A Quick Test of PyPy-STM

The Software Transactional Memory branch of PyPy has been under development for a while now, and the first binary release was made available earlier this month. (See the announcement.) So I went digging through my collection of Project Euler solutions, looking for a good candidate to test with pypy-stm. Basically, I needed a CPU-bound program that used multiple threads, and didn't depend on C libraries that weren't available in PyPy.

I found one, my solution to Project Euler Problem 215. It used the multiprocessing module to run on multiple CPU cores (without tripping over Python's global interpreter lock and ending up with single-core performance). But multiprocessing uses almost the same interface as threading, so it just took a few simple tweaks to switch it over.

Here's the multiprocessing version:

#!/usr/bin/env python

"""Project Euler, problem 215

Consider the problem of building a wall out of 2x1 and 3x1 bricks
(horizontal vertical dimensions) such that, for extra strength, the gaps
between horizontally adjacent bricks never line up in consecutive layers, i.e.
never form a "running crack".

For example, the following 9x3 wall is not acceptable due to the running
crack shown in red:

3222
2232
333

There are eight ways of forming a crack-free 9x3 wall, written W(9,3) = 8.

Calculate W(32,10).
"""

import sys
import multiprocessing


sys.setrecursionlimit(100)


def gen_ways(width, blocks):
    """Generate all possible permutations of items in blocks that add up
    to width, as strings."""
    if width == 0:
        yield ""
    else:
        for block in blocks:
            if block <= width:
                for way in gen_ways(width - block, blocks):
                    yield str(block) + way


def build_mirrors(ways):
    mirrored = set()
    mirrors = set()
    for way in ways:
        rev = "".join(reversed(way))
        if way != rev:
            low, high = sorted([way, rev])
            mirrored.add(low)
            mirrors.add(high)
    return mirrored, mirrors


def find_cracks(way):
    """Return the set of indexes where cracks occur in way"""
    result = set()
    total = 0
    for ch in way[:-1]:
        total += int(ch)
        result.add(total)
    return result


def crack_free(tup1, tup2, cracks):
    """Return True iff tup1 and tup2 can be adjacent without making a
    crack."""
    return not cracks[tup1].intersection(cracks[tup2])


def find_compatible_ways(way, ways, cracks):
    """Return a list of crack-free adjacent ways for way"""
    result = []
    for way2 in ways:
        if crack_free(way, way2, cracks):
            result.append(way2)
    return result


def build_compatible_ways(ways):
    cracks = {}
    for way in ways:
        cracks[way] = find_cracks(way)
    print "done generating %d cracks" % len(cracks)
    compatible_ways = {}
    compatible_ways[()] = ways
    for way in ways:
        compatible_ways[way] = find_compatible_ways(way, ways, cracks)
    return compatible_ways


def gen_combos(height, prev_way, compatible_ways):
    """Generate all ways to make a crack-free wall of size (width, height),
    as height-lists of width-strings."""
    if height == 0:
        return
    elif height == 1:
        for way in compatible_ways[prev_way]:
            yield [way]
    else:
        for way in compatible_ways[prev_way]:
            for combo in gen_combos(height - 1, way, compatible_ways):
                yield [way] + combo


def count_combos(in_queue, out_queue, compatible_ways):
    """Read tuples of (height, prev_way) from in_queue, call gen_combos
    on each, count the number of results, and put the count on out_queue."""
    while True:
        (height, prev_way) = in_queue.get()
        if height is None:
            return
        count = 0
        for combo in gen_combos(height, prev_way, compatible_ways):
            count += 1
        out_queue.put((height, prev_way, count))


def count_combos_memo(in_queue, out_queue, compatible_ways, memo):
    """Read tuples of (height, prev_way) from in_queue, call gen_combos
    on each, chain the result of memo to the last result in the combo
    to get the total count, and put the count on out_queue."""
    while True:
        (height, prev_way) = in_queue.get()
        if height is None:
            return
        count = 0
        for combo in gen_combos(height, prev_way, compatible_ways):
            last = combo[-1]
            count += memo[last]
        out_queue.put((height, prev_way, count))


def W(width, height):
    """Return the number of ways to make a crack-free wall of size
    (width, height)."""
    ways = sorted(gen_ways(width, [2, 3]))
    print "done generating %d ways" % len(ways)

    mirrored, mirrors = build_mirrors(ways)
    print "have %d mirror images " % (len(mirrored))

    compatible_ways = build_compatible_ways(ways)
    print "done generating %d compatible_ways" % sum(map(
        len, compatible_ways.itervalues()))

    in_queue = multiprocessing.Queue()
    out_queue = multiprocessing.Queue()

    cpus = multiprocessing.cpu_count()

    half = (height - 1) // 2
    for way in ways:
        if way not in mirrors:
            in_queue.put((half, way))
    # sentinels
    for unused in xrange(cpus):
        in_queue.put((None, None))
    procs = []
    for unused in xrange(cpus):
        proc = multiprocessing.Process(target=count_combos,
                                       args=(in_queue,
                                             out_queue,
                                             compatible_ways))
        proc.daemon = True
        proc.start()
        procs.append(proc)

    half_memo = {}

    num_ways = len(ways) - len(mirrors)
    for ii in xrange(num_ways):
        (unused, prev_way, count) = out_queue.get()
        half_memo[prev_way] = count
        if prev_way in mirrored:
            half_memo["".join(reversed(prev_way))] = count
        print "(%d/%d) %s mirrored=%d count=%d" % (
            ii + 1, num_ways, prev_way, prev_way in mirrored, count)

    for proc in procs:
        proc.join()

    rest = (height - 1) - half

    for way in ways:
        if way not in mirrors:
            in_queue.put((rest, way))
    # sentinels
    for unused in xrange(cpus):
        in_queue.put((None, None))
    procs = []
    for unused in xrange(cpus):
        proc = multiprocessing.Process(target=count_combos_memo, args=(
                                       in_queue, out_queue, compatible_ways,
                                       half_memo))
        proc.daemon = True
        proc.start()
        procs.append(proc)

    total = 0
    for ii in xrange(num_ways):
        (unused, prev_way, count) = out_queue.get()
        if prev_way in mirrored:
            count *= 2
        total += count
        print "(%d/%d) %s mirrored=%d count=%d total=%d" % (
            ii + 1, num_ways, prev_way, prev_way in mirrored, count, total)
    for proc in procs:
        proc.join()
    return total


def main():
    try:
        width = int(sys.argv[1])
    except IndexError:
        width = 32
    try:
        height = int(sys.argv[2])
    except IndexError:
        height = 10
    print W(width, height)


if __name__ == "__main__":
    main()

and here's the threaded version:

#!/usr/bin/env python

"""Project Euler, problem 215

Consider the problem of building a wall out of 2x1 and 3x1 bricks
(horizontal vertical dimensions) such that, for extra strength, the gaps
between horizontally adjacent bricks never line up in consecutive layers, i.e.
never form a "running crack".

For example, the following 9x3 wall is not acceptable due to the running
crack shown in red:

3222
2232
333

There are eight ways of forming a crack-free 9x3 wall, written W(9,3) = 8.

Calculate W(32,10).
"""

import sys
import multiprocessing
import threading
import Queue


sys.setrecursionlimit(100)


def gen_ways(width, blocks):
    """Generate all possible permutations of items in blocks that add up
    to width, as strings."""
    if width == 0:
        yield ""
    else:
        for block in blocks:
            if block <= width:
                for way in gen_ways(width - block, blocks):
                    yield str(block) + way


def build_mirrors(ways):
    mirrored = set()
    mirrors = set()
    for way in ways:
        rev = "".join(reversed(way))
        if way != rev:
            low, high = sorted([way, rev])
            mirrored.add(low)
            mirrors.add(high)
    return mirrored, mirrors


def find_cracks(way):
    """Return the set of indexes where cracks occur in way"""
    result = set()
    total = 0
    for ch in way[:-1]:
        total += int(ch)
        result.add(total)
    return result


def crack_free(tup1, tup2, cracks):
    """Return True iff tup1 and tup2 can be adjacent without making a
    crack."""
    return not cracks[tup1].intersection(cracks[tup2])


def find_compatible_ways(way, ways, cracks):
    """Return a list of crack-free adjacent ways for way"""
    result = []
    for way2 in ways:
        if crack_free(way, way2, cracks):
            result.append(way2)
    return result


def build_compatible_ways(ways):
    cracks = {}
    for way in ways:
        cracks[way] = find_cracks(way)
    print "done generating %d cracks" % len(cracks)
    compatible_ways = {}
    compatible_ways[()] = ways
    for way in ways:
        compatible_ways[way] = find_compatible_ways(way, ways, cracks)
    return compatible_ways


def gen_combos(height, prev_way, compatible_ways):
    """Generate all ways to make a crack-free wall of size (width, height),
    as height-lists of width-strings."""
    if height == 0:
        return
    elif height == 1:
        for way in compatible_ways[prev_way]:
            yield [way]
    else:
        for way in compatible_ways[prev_way]:
            for combo in gen_combos(height - 1, way, compatible_ways):
                yield [way] + combo


def count_combos(in_queue, out_queue, compatible_ways):
    """Read tuples of (height, prev_way) from in_queue, call gen_combos
    on each, count the number of results, and put the count on out_queue."""
    while True:
        (height, prev_way) = in_queue.get()
        if height is None:
            return
        count = 0
        for combo in gen_combos(height, prev_way, compatible_ways):
            count += 1
        out_queue.put((height, prev_way, count))


def count_combos_memo(in_queue, out_queue, compatible_ways, memo):
    """Read tuples of (height, prev_way) from in_queue, call gen_combos
    on each, chain the result of memo to the last result in the combo
    to get the total count, and put the count on out_queue."""
    while True:
        (height, prev_way) = in_queue.get()
        if height is None:
            return
        count = 0
        for combo in gen_combos(height, prev_way, compatible_ways):
            last = combo[-1]
            count += memo[last]
        out_queue.put((height, prev_way, count))


def W(width, height, cpus=None):
    """Return the number of ways to make a crack-free wall of size
    (width, height)."""
    if cpus is None:
        cpus = multiprocessing.cpu_count()
    ways = sorted(gen_ways(width, [2, 3]))
    print "done generating %d ways" % len(ways)

    mirrored, mirrors = build_mirrors(ways)
    print "have %d mirror images " % (len(mirrored))

    compatible_ways = build_compatible_ways(ways)
    print "done generating %d compatible_ways" % sum(map(
        len, compatible_ways.itervalues()))

    in_queue = Queue.Queue()
    out_queue = Queue.Queue()

    half = (height - 1) // 2
    for way in ways:
        if way not in mirrors:
            in_queue.put((half, way))
    # sentinels
    for unused in xrange(cpus):
        in_queue.put((None, None))
    procs = []
    for unused in xrange(cpus):
        proc = threading.Thread(target=count_combos, args=(in_queue,
                                out_queue, compatible_ways))
        proc.daemon = True
        proc.start()
        procs.append(proc)

    half_memo = {}

    num_ways = len(ways) - len(mirrors)
    for ii in xrange(num_ways):
        (unused, prev_way, count) = out_queue.get()
        half_memo[prev_way] = count
        if prev_way in mirrored:
            half_memo["".join(reversed(prev_way))] = count
        print "(%d/%d) %s mirrored=%d count=%d" % (
            ii + 1, num_ways, prev_way, prev_way in mirrored, count)

    for proc in procs:
        proc.join()

    rest = (height - 1) - half

    for way in ways:
        if way not in mirrors:
            in_queue.put((rest, way))
    # sentinels
    for unused in xrange(cpus):
        in_queue.put((None, None))
    procs = []
    for unused in xrange(cpus):
        proc = threading.Thread(target=count_combos_memo, args=(
                                in_queue, out_queue, compatible_ways,
                                half_memo))
        proc.daemon = True
        proc.start()
        procs.append(proc)

    total = 0
    for ii in xrange(num_ways):
        (unused, prev_way, count) = out_queue.get()
        if prev_way in mirrored:
            count *= 2
        total += count
        print "(%d/%d) %s mirrored=%d count=%d total=%d" % (
            ii + 1, num_ways, prev_way, prev_way in mirrored, count, total)
    for proc in procs:
        proc.join()
    return total


def main():
    try:
        width = int(sys.argv[1])
    except IndexError:
        width = 32
    try:
        height = int(sys.argv[2])
    except IndexError:
        height = 10
    try:
        cpus = int(sys.argv[3])
    except IndexError:
        cpus = multiprocessing.cpu_count()
    print W(width, height, cpus)


if __name__ == "__main__":
    main()

Anyway, on my old Phenom II 3.0 GHz quad-core CPU, under Kubuntu Linux 14.04, performance looks like this:

Python Elapsed time (mm:ss)
CPython 2.7.6 21:13
PyPy 2.3.1 7:06
pypy-stm 2.3r2 11:27

So, basically, PyPy is almost 3 times as fast as CPython, and pypy-stm adds enough overhead to be about 60% slower than PyPy. (Note that since this version already uses multiprocessing to run on multiple CPUs, STM is pretty much all overhead no benefit here.)

The more interesting result is for the threaded version. We'd expect it to be about 4 times as slow as the multiprocessing version on CPython and vanilla PyPy, and we'd hope it to be less than 4 times as slow on pypy-stm, showing a benefit from transactional memory letting single-process multi-threaded code avoid the GIL.

Unfortunately, I guess this program uses a bit too much memory for the current version of pypy-stm, as it consistently segfaults about 3 minutes in. Armin's blog post (above) warned that this might happen due to a bug in LLVM. Oh well.

Fortunately, my program takes command-line options that can be used to vary the size of the problem. So, instead of building a wall that's 32 units wide and 10 units high, let's build one that's only 18 units wide and 8 units high. (I picked those values experimentally, trying to find the biggest numbers that let the program complete successfully many times in a row on pypy-stm.)

First, the multiprocessing version:

Python Elapsed time (seconds)
CPython 2.7.6 0.052
PyPy 2.3.1 0.167
pypy-stm 2.3r2 3.28

So, due to JIT startup overhead, CPython is actually faster than PyPy here. pypy-stm's overhead looks really bad on the smaller problem size.

Then, the more interesting result, the threaded version:

Python Elapsed time (seconds)
CPython 2.7.6 0.044
PyPy 2.3.1 0.176
pypy-stm 2.3r2 1.21

Threads are actually slightly faster than multiprocessing on both CPython and vanilla PyPy here, I guess because the overhead of forking subprocesses exceeded the benefit of using multiple CPUs on such a small problem size. Note that pypy-stm is significantly faster on the threaded version than the multiprocessing version, at least on its best run. (All results shown are best of 3 runs.)

One interesting thing is that pypy-stm's performance was highly variable. Over the 3 runs, I saw speeds varying from 1.21s to over 3s. It appears there's an element of luck in whether the transactions collide, and when they do, the code takes longer to finish (but still gives the correct result.)

In conclusion, pypy-stm is still highly experimental, will crash if you give it a program that uses too much memory, and isn't fast enough yet to be helpful in this benchmark. Even though Python programmers love to whine about the GIL, the multiprocessing module already gives a pretty nice way around it, so the bar for pypy-stm to be practical in production (as opposed to a nice theoretical result) is pretty high. Still, seeing transactional memory work at all to parallelize threads in Python is very impressive. (Haskell has had a nice STM implementation for a while, but then Haskell doesn't allow side effects in most functions so it's a lot easier to parallelize than a language like Python.)

I'll be donating to the next phase of the pypy-stm effort, and looking forward to better results in the future.