Improved concurrency for Python

A discussion draft for an enhancement to Python.
John Nagle (nagle@animats.com)
June, 2010 - Version 0.8

Introduction

This is a discussion draft for a novel approach to concurrency in Python. It may in time become a Python Enhancement Proposal. Current Python implementations map inefficiently to multiprocessors. Now that most server CPUs have at least two processors, and may have many more, the time has come to move beyond this.

This is a technical paper about the theory behind "newthreading". For a user-oriented guide, see the User's Guide to "newthreading"

Rationale

The CPython implementation is inherently single-thread. Attempts to move beyond the "Global Interpreter Lock" era have not been notably successful, given the existing semantics of concurrency in Python. Efforts to speed up the execution of single-thread programs (Shed Skin, Unladen Swallow, etc.) have been successful, but the concurrency problem has not been overcome. Given the existing semantics of concurrency in Python, the problem appears insoluble. Thus this proposal for new concurrency semantics.

The model of concurrency proposed here is conventional. Most data is local to a thread. Shared data is protected by monitors, implemented as synchronized classes. Inter-thread message passing via queues is supported. These conditions are sufficient to allow multi-threaded execution without global locking. Python's concept of immutability gives us an edge; we can allow the sharing of immutable data between threads. The widespread use of immutable data in Python programs makes this approach more palatable than it would be in most languages.

Python semantics for single-thread programs do not change. The semantics of multi-thread programs, however, change significantly.

Basic concepts

Python currently implements various "immutable" types: numbers, strings, immutable sets, immutable byte strings, and a few others. Immutable objects could potentially be shared between threads without locking, provided only that some means is provided to recover their storage should they be cease to be referenced. This proposal extends immutability by allowing immutable user-defined objects, and to provide mutable, but locking, "synchronized" objects. Immutable and synchronized objects can be shared between threads; other objects cannot.

Synchronized objects

Synchronized objects have the basic property that only one thread at a time can be active within the object. Synchronized objects are created by defining a class which inherits from SynchronizedObject, then instantiating that class.

When a thread enters a synchronized object, it implicitly locks a lock associated with the object. If a thread blocks within a synchronized object, the object is unlocked for the duration of the block.

So far, this is a conventional monitor. Alone, this approach to monitors would not be airtight. A thread could pass a reference to an object into a synchronized object, which could then allow another thread to access the object. This creates the potential for race conditions. It is thus necessary to impose an additional restriction - only immutable objects can be passed in and out of synchronized objects.

This is a tough restriction, and could lead to excessive copying. To make this more convenient, there is a "freezing" mechanism which makes an immutable copy of Python objects. "freezing" is automatically applied to parameters going into methods of synchronized objects, and is applied to return values going out.

Object locks allow re-entry to the object by the same thread, so recursion is permitted.

Access to data fields of an object is from outside the object is permitted under the usual Python rules.. Such access implies a call to __getattr__() or __setattr__() , either in their built-in forms or as user-defined functions. For a synchronized class, these functions lock and unlock the object's lock, maintaining concurrency safety.

Atomic objects

The base class AtomicObject is provided for simple cases where a traditional monitor is desired. Only one thread at a time can be inside an AtomicObject. An AtomicObject uses a nonrecursive lock, rather than the recursive lock used by SynchronizedObject , and that lock is not unlocked if a thread blocks inside the object. Thus, an AtomicObject is a traditional multiprocessing monitor.

Some restrictions must be imposed to eliminate ambiguity over whether control is “inside” or “outside” the object.

  1. Class member functions and data members beginning with “_” and “__” are not visible outside the function. This is strictly enforced, because those member functions don't do any locking. They rely on the lock set when an externally visible function was called.

  2. Class member functions not beginning with “_” and “__” should not be called from within member functions of the object. Doing so will cause an immediate deadlock, because the object has a non-recursive lock.

  3. Data member functions not beginning with “_” and “__” should not be acccessed from within the member functions of the object. Doing so will cause an immediate deadlock, because the object has a non-recursive lock. The __init__ function is exempt from this restriction, because it runs with the object unlocked. Since the object is only accessable from one thread during construction, this is safe.

  4. The "freezing" and immutable object restrictions of SynchronizedObject apply to AtomicObject.

The locking overhead of an AtomicObject is less than that of a SynchronizedObject. On many machines, locking and unlocking nonrecursive locks is a very cheap operation and does not require a system call. So an AtomicObject should be used for classes that do not need to block. AtomicObject classes are useful for encapsulating shared data with minimal overhead. They can also be used for "pure data", with data members that don't

As mentioned, atomic objects will deadlock if they call their own externally visible function members. This can happen inadvertently if a function calls outside itself, and some other function calls back into the original object. Thus, atomic objects are most useful when they are self-contained and do little calling of other objects.

Immutable user-defined objects

Immutable user-defined classes can be declared by inheriting from ImmutableObject. Immutable user-defined objects are instances of such classes, and cannot change once __init__() has returned. Outside __init__(), no attribute of the object can be changed. __init__ can be called only once for each object. Immutable classes can inherit only from other immutable classes. Immutable classes cannot have destructors.

"Freezing"

New function: freeze(obj) which returns only shareable objects, ones which are either immutable or synchronized. ("synchronized" here includes both SynchronizedObject and AtomicObject).

Table 1 - mutable and immutable Python types
Mutable Immutable Notes
-
bool  
-
int  
-
float  
-
long  
-
complex  
-
str  
-
unicode Merged into "str" in Python 3
list tuple  
set frozenset  
dict frozendict Currently available in some libraries, but not a built-in
bytes bytearray  
listiterator
-
Not useful as an immutable object
xrange
-
Has internal state; not useful as an immutable.
class f(Object)
-
Mutable user object
-
class f(ImmutableObject) Immutable user object

The related predicate isfrozen(obj)indicates whether an object is "frozen". isfrozen is true iff freeze(obj) == obj.

PEP 351, in 2005, proposed something similar, but with less justification. Python has acquired more built-in types since then, and the new ones come in mutable and immutable forms, so almost all the necessary types already exist. The justification for "freezing" here is that immutable objects can easily be shared between threads.

"frozen" and "immutable" are not the same. "frozen" objects cannot contain mutable unsynchronized objects. A tuple contaning a list is immutable, but not frozen - it is not safe to share betweent threads.

"Freezing" of the program

Startup of a Python program is a dynamic process. "import" is considered executable, and can be executed conditionally. Functions and classes can be modified and patched dynamically. However, when a program has more than one active thread, elaborate locking is required to make this level of dynamism work. As a compromise, this proposal allows all the usual dynamism of Python until the moment the program creates its second thread. This is implemented through freezeprogram().

Upon the first call to freezeprogram(), the following events happen:

At the creation of the second thread, the point at which the program goes multi-thread, freezeprogram() is implicitly invoked.

Global variables

Global variables are allowed, but can contain only frozen types. Global variables can be assigned new (frozen) values.

As an optimization, global variable names which are all upper case, the Python convention for constants, are treated as constants, and cannot be assigned new values. This allows fast access to constants without locking. Access to other global variables implies setting an implicit lock on the module containing the variable during the access.

Eval

eval is allowed, but only in its multiple-argument form, in which a list of objects which can be accessed is provided. The arguments to "eval" are "frozen", so that "eval" can only operate on thread-safe data.

Threading

The semantics of the 'threading” module do not change. Threads are created and joined as before.

Implications

What does this all mean, and how can it be used?

First, note that the semantics of single-thread programs do not change. This allows for backwards compatibility with most routine Python programs.

This approach supports the two standard models of concurrency - message passing and shared data. Threaded programs which work by passing messages from one thread to another via a queue object are generally compatible with this approach, but will require some modification. Queues can be implemented using synchronized objects.

Shared data is permitted, but the sharing is limited. Synchronized objects communicate with the threads that use them only through data passed by value. It is not permitted to pass a reference to a data structure out of a synchronized object. This is a severe restriction, but it makes threaded programs safe against race conditions.

For most languages, this would be too severe a restriction. Excessive copying would be required. Python has an advantage here - much data is immutable. Immutable objects can be "passed by value" without actually making a copy. Multiple threads can safely share read-only objects without locking. Strings, tuples, and other frozen objects can be passed in and out of synchronized objects without copying overhead. This is what makes the concept work.

The restrictions on global variables are designed to retain the dynamic nature of Python during program startup, during which modules can be optionally loaded, parameters and configuration files read, and the program configured as desired. Once the program goes multi-thread, however, the code can no longer change. Globally visible variables are also restricted. This is a minor issue for programs which are strongly object-oriented, but programs which use global variables to excess will need re-working.

Old New way to code shared globals, with locking

# Globals
verbose = True
loglevel = 0
debug = True
flags = {}

...

loglevel = 1
flags['a']= True

...

if flags['a'] :

print("A flag")

class options(SynchronizedObject) :

"""
Global option variables - thread-safe
"""
verbose = True
loglevel = 0
debug = True
flags = {}

def setflag(flag, val) :

"""
Access fn. Set flag, with automatic locking.
"""
self.flags[flag] = val

...

options.loglevel = 1 # Simple case
options.setflag('a',True) # set in options obj, not copy.

if options.flags['a'] : # copies dict, but works.

print("A flag")

# WRONG - will try to update an immutable dictionary
# and will throw an exception.
options.flags['a'] = True

   

The example above shows access to synchronized global variables. Types which are modified by simple replacement need no special handling. Types which are modified in place, such as sequences, need an access function. Note that a likely programmer mistake, trying to modify a field of a synchronized object in place, will throw an exception because the user is accessing an immutable copy of the field. So such errors will not go undetected.

Blocking

In this document, “blocking” means calling an explicit lock's “acquire” method. The “sleep” primitive is also considered a block.. Waiting for a lock to enter an object is not “blocking” in this sense.

When a synchronized object calls a “acquire” primitive and blocks, the object is unlocked. So are all outer synchronized objects currently locked by the thread being blocked. When the lock is acqired or the sleep finishes, all the object locks are relocked. This automatic locking takes place from the outermost class inward, to avoid deadlock.

Closures and references to function members

Closures and references to function members cannot be passed between threads. To enforce this, “freeze” will raise an exception if it encounters a closure or a reference to a function member.

The attributes of a function are locked, like global variables, at freezeprogram() This prevents mutable objects in one thread from becoming visible to another.

Default function parameters for all functions are, like global variables, tested with frozen() when freezeprogram() is called. This prevents the following situation:

def bar(val = [ ]) :

...

This usage (which is usually a bug) traditionally results in all calls to “bar” sharing one mutable empty array as the default value of “val”. That would be a violation of the basic concurrency rules. The check at “freezeprogram()” will detect this and raise an exception.

Memory model

Conceptually, immutable objects and synchronized objects are shared across threads, while everything else is thread-local. This suggests a memory management scheme in which the objects owned by threads and synchronized objects are reference counted, as at present, while immutable objects reside in a separate heap and are garbage-collected by a concurrent collector.

A possible implementation memory model for this has three memory zones:

  1. Shared fully-immutable objects: primarily strings, numbers, and tuples, all of whose elements are fully immutable.  These can be shared without locking, and reclaimed by a concurrent garbage collector like Boehm's.  They have no destructors, so finalization is not an issue.
  2. Local objects: These are managed as at present, and require no locking.  These can either be thread-local, or local to a synchronized object.  There are no links between local objects under different "ownership".  Whether each thread and object has its own private heap, or whether there's a common heap with locks at the allocator, is an implementation decision.
  3. Shared mutable objects: mostly synchronized objects, but also immutable objects like tuples which contain references to objects that aren't fully immutable.  These are the high-overhead objects, and require locking during reference count updates, or atomic reference count operations if supported by the hardware. The general idea is to minimize the number of objects in this zone.   

The zone of an object is determined when the object is created, and never changes.   This is relatively simple to implement. Tuples (and frozensets, frozendicts, etc.) are normally zone 2 objects.  Only "freeze" creates collections in zones 1 and 3. Synchronized objects are always created in zone 3. There are no difficult handoffs, where an object that was previously thread-local now has to be shared and has to acquire locks during the transition.

Existing interlinked data structures, like parse trees and GUIs, are by default zone 2 objects, with the same semantics as at present.  They can be placed inside a SynchronizedObject if desired, which makes them usable from multiple threads. That's optional; they're thread-local otherwise.

Safety issues

Race conditions

This approach is designed to be "memory safe", protecting the internals of the language, for any Python program. Most current Python implementations have this property, and it should be maintained. This design needs to be carefully checked to insure that there are no places where references can "leak" from one thread to another.

Deadlock

If a thread is inside more than one synchronized object, a lock is set for each object. Deadlock is thus possible. Synchronized objects may be provided with a timeout option which raises an exception if the object cannot be entered within the time limit.

A more subtle cause of deadlock is the temporary unlocking of an object when a thread blocks inside a synchronized object. When the thread unblocks, it then must re-lock the object. This re-locking operation cannot be allowed to raise a timeout exception, because the catch point of the exception could be within the object. Allowing an object to catch its own lock timeout would result in two threads in the same object at the same time, which is a race condition.

Exceptions raised within a temporary unlock are possible, and the exception unwinder may have to wait for the lock on the synchronized object. It's thus possible for a thread to deadlock during exception processing. Again, a timeout here cannot be allowed. Locking soundness must be maintained even during exception processing.

In general, race conditions are more of a problem than deadlocks. Race conditions happen randomly, depending upon timing. Deadlock problems tend to be repeatable and show up in testing.

Use cases

Example 1
A simple Queue class
class Queue(newthreading.SynchronizedObject) :
    """
    Basic Queue class using new threading system.
    """

    def __init__(self, maxsize = 0) :
        """
        Initialize FIFO queue of specified size.  Size 0 is treated as infinite
        """
        newthreading.SynchronizedObject.__init__(self)			# init parent
        self.maxsize = maxsize
        self.items = []
        self.lock = newthreading.Semaphore(0)
   
    def put(self, item) :
        """
        Put item on queue.  A frozen copy of the item will be made if necessary.
        """
        if self.maxsize > 0 and length(self.items) >= self.maxsize :
            raise IndexError("Queue overflow")
        self.items.append(item)
        self.lock.release()

    def get(self) :
        """
        Get item from queue.  Block if no items available
        """
        self.lock.acquire()     # This waits on the semaphore and unlocks the object.
        return(self.items.pop(0)) 
      

This is a basic synchronized class. Only one thread can be in this class at a time, except when a thread in "get" is blocked at "self.lock.acquire.()". So we don't have to worry about race conditions in accessing the list of items, "self.items". The built-in locking protects the underlying data of Python, with no need for a global lock.

Example 2
A synchronizer for RSS feeds with slow polling
class PollSynchronizer(newthreading.SynchronizedObject) :
    """
    Synchronizer for RSS feed operations which use slow HTTP polling
    """
   
    def __init__(self, timeout = 30.0) :
        """
        Initializer
        """
        newthreading.SynchronizedObject.__init__(self)        # initialize base class
        self.changelock = newthreading.Semaphore(0)
        self.waitcount = 0
        self.changeid = 0
        self.sleepthread = None
        self.timeout = timeout
   
    def setchangeid(self, id) :
        """
        Report thae the ID of the feed has changed.  All waiting
        threads are awakened.
        """
        self.changeid = id
        self.__notifyall()                            # wake up everybody
   
    def waitforchange(self, id) :
        """
        Wait until the current feed ID is different from the one given,
        or until the timeout runs out. Returns True if the id has changed.
        False on timeout with no ID change.
        """
        endtime = time.time() + self.timeout                # wake up at this time if no event
        if self.sleepthread is None :
            self.sleepthread = newthreading.Thread(None, self.tick)    # start tick thread
        while self.changeid == id and (not self.changeid is None) and endtime > time.time() :
            self.waitcount += 1
            self.changelock.acquire()
            self.waitcount -= 1
        return((self.changeid != id, self.changeid))
        
    def __notifyall(self) :
        cnt = self.waitcount 
        print("NotifyAll, %d items" % (cnt,))
        for i in xrange(cnt) :
            self.changelock.release()
        
    def tick(self) :
        """
        tick  --  background thread, wakes up everybody every N seconds
        """
        print("Starting tick thread %d" % (thread.get_ident(),))            # startup msg
        while not self.changeid is None :                    # stops when 
            print("Tick...")
            newthreading.sleep(self.timeout * 0.25)
            self.__notifyall()
        self.__notifyall()                                    # wake all on exit
        
      

This snippet of code is a server-side syncronizer for RSS feeds which use slow HTTP polling. The goal here is to immediately inform the client when data has changed on the server, without overly frequent polling. The client makes HTTP requests which are deliberately stalled by the server for up to 30 seconds if no new items are available, but are answered immediately when new data is available. RSS feeds use an ID (typically a hash of the data, or a timestamp) which changes when the feed data changes.

Each server thread handling a client HTTP request calls waitforchange() with the ID from the last poll and a timeout value in seconds. The call returns either when the timer runs out, or immediately when the data source changes the ID by calling setchangeid().

Note the simplicity of the code. Written with traditional locking primitives, this is a tough piece of code to get right, even with a global lock protecting the underlying implementation. Because the object is synchronized, the obvious code is a solution free of race conditions. Because the synchronized object unlocks when a thread blocks, using a synchronized object does not deadlock the program. As before, the built-in locking eliminates the need for a global lock to protect the underlying data structures.

Example 3
Safe global variables
import newthreading
class GlobalVars(newthreading.AtomicObject) :
    """
    Global variable storage
    """
   
    def __init__(self) :
        """
        Initializer
        """
        self.verbose = False
        self.debug = False
self.options = [] gvars = GlobalVars() # Create singleton to hold global vars ... gvars.verbose = True # will lock, set, and unlock gvars.options = ['a', 'd'] # gets turned into a tuple ... for opt in gvars.options : print("Option '%s' is set", (opt,)) gvars.options = ['e','f'] # OK ... gvars.options.append('f') # FAIL - append to tuple.

This is simple thread-safe storage of global variables. Global variables must be immutable or synchronized. Encapsulating global variables in an AtomicObject makes them synchronized. For simple types, variables can be accessed and changed normally. Global variables of mutable types cannot be modified in place from outside the object; that would introduce a race condition.

Note that storing a mutable object, like a list, into one of these global variables actually stores a tuple copy of the list. The assignment

gvars.options = ['e','f']

is treated as

setattr(gvars,"options", freeze(['e','f''])

which stores a "frozen" copy of the list. A frozen list is a tuple. Access to gvars.options, however, does not require a copy. Since the stored value is immutable, access is cheap.

(more examples to be supplied)

Proof of concept implementation

We have developed a proof of concept implementation of this approach. The implementation is a pure Python module which can be imported into CPython. AtomicObject and SynchronizedObject are fully implemented as described. The functions freeze, isfrozen, and freezeprogram are implemented and perform as described. However, due to limitations in CPython's dynamism, not all the restrictions proposed for global variables could be enforced. The implementation will be available shortly on Google Code or SourceForge. We are currently targeting CPython 2.6, but few changes would be required for Python 3.1.

This implementation allows trying out the new threading primitives to find out what it is like to program using them. The locking mechanisms work, and allow easier implementation of threaded programs. There is no performance gain in this implementation, since it's simply built on top of the existing CPython.

Prior work

There have been several attempts to remove CPython's "Global Interpreter Lock" over the years, with increasing urgency as multi-core CPUs have become the standard. The process has been painful. Adam Olsen's Python Safe Threading project achieved a successful implementation, but with a 30% performance penalty on non-threaded programs. This led to abandonment of the project.

We need to avoid the problems which killed that approach.

Olsen ran into a serious performance problem with modifications to classes and modules at run time. He had to introduce a complex data structure called a shareddict to allow modifications to modules and classes during execution. The locking requirements for module and class dictionaries to support self-modifying concurrent code slowed down the implementation excessively.

Our approach to this is to allow dynamic modifications to modules and classes until the program goes multi-thread. We then prohibit them. The key idea is that the use cases for most of Python's code dynamism are during setup and initialization. Few useful programs change code once the program has gone into its heavy parallel processing phase. This suggests a practical compromise.

We allow sharing of synchronized objects and immutable objects between threads. This follows Olsen. We don't currently provide a "shareable dictionary", although AtomicObject could easily be used to encapsulate a dictionary in a way that makes it shareable.

A major issue is the unlocking of classes when control "leaves" a class. Simple monitors are too limited; if a thread blocks in a monitor, the whole monitor is locked up until the thread unblocks. Java's "synchronized classes" ran into this problem. Olsen deals with this by introducing "conditions", a new locking primitive. We stay with the classics, "Semaphore", "sleep", and so forth. But when a thread blocks on one of those primitives within a SynchronizedObject class, the entire stack of SynchronizedObject classes for that thread is automatically unlocked. This maintains the "only one thread at a time active in a class" rule, which is essential to safe memory management. More experience is needed with this approach. It does seem to eliminate the need for most explicit locking in user code. This is a big win, since that's a major source of programming errors in threaded programs.

The memory model for this approach to threading seems to require three categories of memory management. Thread-local objects are reference counted, as at present. Fully immutable objects reside in their own space, are shared, and are recovered by a concurrent garbage collector which does not do finalization. Shared, mutable objects (primarily our synchronized objects) require reference counts protected by per-object locks. The category of an object for memory management purposes is determined at creation and does not change, except for code objects, which become immutable when the program is "frozen".These are all well-understood mechanisms. Only for synchronized objects is there additional performance cost over existing implementations. For immutable objects, the cost should go down.

The Python multiprocessing module tries to graft Python's existing thread-like semantics onto a non-shared environment. The restrictions of the multiprocessing module are somewhat similar to this new threading approach - objects cannot be shared, and queues make copies of enqueued objects. The multiprocessing module is much more restrictive than this threading approach. The threading approach described here should allow conversion of multiprocessing programs into multithreading programs, without loss of safety or performance and with a substantial reduction in memory and cache footprint.

The just-in-time compiler for CPython, "Unladen Swallow", is currently usable only for single-thread programs. The approach here should allow its extension to multi-thread programs. The first call to "freezeprogram()" is the ideal time to compile all code, since it will not change again.

Revisions

PEP data items (future use)

PEP: (to be defined)
Title: Improved concurrency for Python
Version: <svn version string>
Last-Modified: <svn date string>
Author: John Nagle nagle@animats.com
Discussions-To: <email address>
Status: Draft
Type: Informational
Content-Type: <text/plain | text/x-rst>
Requires: <pep numbers>
Created: 01-06-2010
Python-Version: <version number>
Post-History: <dates of postings to python-list and python-dev>
Replaces: <pep number>
Replaced-By: <pep number>
Resolution: <url>