Threads and GCs
Hi all,
We can now compile a pypy-c that includes both thread support and one of our semi-advanced garbage collectors. This means that threaded Python programs can now run not only with a better performance, but without the annoyances of the Boehm garbage collector. (For example, Boehm doesn't like too much seeing large numbers of __del__(), and our implementation of ctypes uses them everywhere.)
Magic translation command (example):
translate.py --thread --gc=hybrid targetpypystandalone --faassen --allworkingmodules
Note that multithreading in PyPy is based on a global interpreter lock, as in CPython. I imagine that we will get rid of the global interpreter lock at some point in the future -- I can certainly see how this might be done in PyPy, unlike in CPython -- but it will be a lot of work nevertheless. Given our current priorities, it will probably not occur soon unless someone steps in.
Progresses on the CLI JIT backend front
In the last months, I've actively worked on the CLI backend for PyPy's JIT generator, whose goal is to automatically generate JIT compilers that produces .NET bytecode on the fly.
The CLI JIT backend is far from be completed and there is still a lot of work to be done before it can handle the full PyPy's Python interpreter; nevertheless, yesterday I finally got the first .NET executable that contains a JIT for a very simple toy language called tlr, which implements an interpreter for a minimal register based virtual machine with only 8 operations.
To compile the tlr VM, follow these steps:
get a fresh checkout of the oo-jit branch, i.e. the branch where the CLI JIT development goes on:
$ svn co https://codespeak.net/svn/pypy/branch/oo-jitgo to the oo-jit/pypy/jit/tl directory, and compile the tlr VM with the CLI backend and JIT enabled:
$ cd oo-jit/pypy/jit/tl/ $ ../../translator/goal/translate.py -b cli --jit --batch targettlr
The goal of our test program is to compute the square of a given number; since the only operations supported by the VM are addition and negation, we compute the result by doing repetitive additions; I won't describe the exact meaning of all the tlr bytecodes here, as they are quite self-documenting:
ALLOCATE, 3, # make space for three registers MOV_A_R, 0, # i = a MOV_A_R, 1, # copy of 'a' SET_A, 0, MOV_A_R, 2, # res = 0 # 10: SET_A, 1, NEG_A, ADD_R_TO_A, 0, MOV_A_R, 0, # i-- MOV_R_A, 2, ADD_R_TO_A, 1, MOV_A_R, 2, # res += a MOV_R_A, 0, JUMP_IF_A, 10, # if i!=0: goto 10 MOV_R_A, 2, RETURN_A # return res
You can find the program also at the end of the tlr module; to get an assembled version of the bytecode, ready to be interpreted, run this command:
$ python tlr.py assemble > square.tlr
Now, we are ready to execute the code through the tlr VM; if you are using Linux/Mono, you can simply execute the targettlr-cli script that has been created for you; however, if you use Windows, you have to manually fish the executable inside the targettlr-cli-data directory:
# Linux $ ./targettlr-cli square.tlr 16 256 # Windows > targettlr-cli-data\main.exe square.tlr 16 256
Cool, our program computed the result correctly! But, how can we be sure that it really JIT compiled our code instead of interpreting it? To inspect the code that it's generated by our JIT compiler, we simply set the PYPYJITLOG environment variable to a filename, so that the JIT will create a .NET assembly containing all the code that has been generated by the JIT:
$ PYPYJITLOG=generated.dll ./targettlr-cli square.tlr 16 256 $ file generated.dll generated.dll: MS-DOS executable PE for MS Windows (DLL) (console) Intel 80386 32-bit
Now, we can inspect the DLL with any IL disassembler, such as ilasm or monodis; here is an excerpt of the disassembled code, that shows how our square.tlr bytecode has been compiled to .NET bytecode:
.method public static hidebysig default int32 invoke (object[] A_0, int32 A_1) cil managed { .maxstack 3 .locals init (int32 V_0, int32 V_1, int32 V_2, int32 V_3, int32 V_4, int32 V_5) ldc.i4 -1 ldarg.1 add stloc.1 ldc.i4 0 ldarg.1 add stloc.2 IL_0010: ldloc.1 ldc.i4.0 cgt.un stloc.3 ldloc.3 brfalse IL_003b ldc.i4 -1 ldloc.1 add stloc.s 4 ldloc.2 ldarg.1 add stloc.s 5 ldloc.s 5 stloc.2 ldloc.s 4 stloc.1 ldarg.1 starg 1 nop nop br IL_0010 IL_003b: ldloc.2 stloc.0 br IL_0042 ldloc.0 ret }
If you know a bit IL, you can see that the code generated is not optimal, as there are some redundant operations like all those stloc/ldloc pairs; however, while not optimal, it is still quite good code, not much different to what you would get by writing the square algorithm directly in e.g. C#.
As I said before, all of this is still work in progress and there is still much to be done. Stay tuned :-).
So the mono JIT would pick up that bytecode and further compile it to native code?
Also, what would be needed for doing the same thing for the JVM?
Yes, that's exactly the idea; in fact, the program run by virtual machines generated this way are double jit-ed.
Doing the same for the JVM won't be too hard, since most of the work we've done can be shared between the two JIT backends; unfortunately, at the moment the JVM backend is not as advanced as the CLI one, so before working on the JIT we would need more work on it. But indeed, having a JIT backend for the JVM is in our plans.
More windows support
Recently, thanks to Amaury Forgeot d'Arc and Michael Schneider, Windows became more of a first-class platform for PyPy's Python interpreter. Most RPython extension modules are now considered working (apart from some POSIX specific modules). Even CTypes now works on windows!
Next step would be to have better buildbot support for all supported platforms (Windows, Linux and OS X), so we can control and react to regressions quickly. (Buildbot is maintained by JP Calderone)
Cheers,
fijal
S3-Workshop Potsdam 2008 Writeup
Trying to give some notes about the S3 Workshop in Potsdam that several PyPyers and Spies (Armin, Carl Friedrich, Niko, Toon, Adrian) attended before the Berlin sprint. We presented a paper about SPy there. Below are some mostly random note about my (Carl Friedrich's) impressions of the conference and some talk notes. Before that I'd like to give thanks to the organizers who did a great job. The workshop was well organized, the social events were wonderful (a very relaxing boat trip in the many lakes around Potsdam and a conference dinner).
Video recordings of all the talks can be found on the program page.
Invited Talks
"Late-bound Object Lambda Architectures" by Ian Piumarta was quite an inspiring talk about VPRI's attempt at writing a flexible and understandable computing system in 20K lines of code. The talk was lacking a bit in technical details, so while it was inspiring I couldn't really say much about their implementation. Apart from that, I disagree with some of their goals, but that's the topic of another blog post.
"The Lively Kernel – A Self-supporting System on a Web Page" by Dan Ingalls. Dan Ingalls is one of the inventors of the original Smalltalk and of Squeak. He was talking about his latest work, the attempts of bringing a Squeak-like system to a web browser using JavaScript and SVG. To get some feel for what exactly The Lively Kernel is, it is easiest to just try it out (only works in Safari and Firefox 3 above Beta 5 though). I guess in a sense the progress of the Lively Kernel over Squeak is not that great but Dan seems to be having fun. Dan is an incredibly enthusiastic, friendly and positive person, it was really great meeting him. He even seemed to like some of the ideas in SPy.
"On Sustaining Self" by Richard P. Gabriel was a sort of deconstructivist multi-media-show train wreck of a presentation that was a bit too weird for my taste. There was a lot of music, there were sections in the presentation where Richard discussed with an alter ego, whose part he had recorded in advance and mangled with a sound editor. There was a large bit of a documentary about Levittown. Even the introduction and the questions were weird, with Pascal Constanza staring down the audience, without saying a word (nobody dared to ask questions). I am not sure I saw the point of the presentation, apart from getting the audience to think, which probably worked. It seems that there are people (e.g. Christian Neukirchen) that liked the presentation, though.
Research Papers
"SBCL - A Sanely Bootstrappable Common Lisp by Christophe Rhodes described the bootstrapping process of SBCL (Steel Bank Common Lisp). SBCL can be bootstrapped by a variety of Common Lisps, not just by itself. SBCL contains a complete blueprint of the initial image instead of always getting the new image by carefully mutating the old one. This bootstrapping approach is sort of similar to that of PyPy.
"Reflection for the Masses" by Charlotte Herzeel, Pascal Costanza, and Theo D'Hondt retraced some of the work of Brian Smith on reflection in Lisp. The talk was not very good, it was way too long (40 min), quite hard to understand because Charlotte Herzeel was talking in a very low voice. The biggest mistake in her talk was in my opinion that she spent too much time explaining a more or less standard meta-circular interpreter for Lisp and then running out of time when she was trying to explain the modifications. I guess it would have been a fair assumptions that large parts of the audience know such interpreters, so glossing over the details would have been fine. A bit of a pity, since the paper seems interesting.
"Back to the Future in One Week - Implementing a Smalltalk VM in PyPy" by Carl Friedrich Bolz, Adrian Kuhn, Adrian Lienhard, Nicholas D. Matsakis, Oscar Nierstrasz, Lukas Renggli, Armin Rigo and Toon Verwaest, the paper with the longest author list. We just made everybody an author who was at the sprint in Bern. Our paper had more authors than all the other papers together :-). I gave the presentation at the workshop, which went quite well, judging from the feedback I got.
"Huemul - A Smalltalk Implementation" by Guillermo Adrián Molina. Huemul is a Smalltalk implementation that doesn't contain an interpreter but directly compiles all methods to assembler (and also saves the assembler in the image). In addition, as much functionality (such as threading, GUI) as possible is delegated to libraries instead of reimplementing them in Smalltalk (as e.g. Squeak is doing). The approach seems to suffer from the usual problems of manually writing a JIT, e.g. the VM seems to segfault pretty often. Also I don't agree with some of the design decisions of the threading scheme, there is no automatic locking of objects at all, instead the user code is responsible for preventing concurrent accesses from messing up things (which even seems to lead to segfaults in the default image).
"Are Bytecodes an Atavism?" by Theo D'Hondt argued that using AST-based interpreters can be as fast as bytecode-based interpreters which he proved by writing two AST-interpreters, one for Pico and one for Scheme. Both of these implementations seem to perform pretty well. Theo seems to have many similar views as PyPy, for example that writing simple straightforward interpreters is often preferable than writing complex (JIT-)compilers.
Berlin Sprint Finished
The Berlin sprint is finished, below some notes on what we worked on during the last three days:
- Camillo worked tirelessly on the gameboy emulator with some occasional input by various people. He is making good progress, some test ROMs run now on the translated emulator. However, the graphics are still not completely working for unclear reasons. Since PyBoy is already taken as a project name, we considered calling it PyGirl (another name proposition was "BoyBoy", but the implementation is not circular enough for that).
- On Monday Armin and Samuele fixed the problem with our multimethods so that the builtin shortcut works again (the builtin shortcut is an optimization that speeds up all operations on builtin non-subclassed types quite a bit).
- Antonio and Holger (who hasn't been on a sprint in a while, great to have you back!) worked on writing a conftest file (the plugin mechanism of py.test) that would allow us to run Django tests using py.test, which seems to be not completely trivial. They also fixed some bugs in PyPy's Python interpreter, e.g. related to dictionary subclassing.
- Karl started adding sound support to the RPython SDL-bindings, which will be needed both by the Gameboy emulator and eventually by the SPy VM.
- Armin and Maciek continued the work that Maciek had started a while ago of improving the speed of PyPy's IO operation. In the past, doing IO usually involved copying lots of memory around, which should have improved now. Armin and Maciek improved and then merged the first of the two branches that contained IO improvements, which speeds up IO on non-moving GCs (mostly the Boehm GC). Then they continued working on the hybrid-io branch which is supposed improve IO on the hybrid GC (which was partially designed exactly for this).
- Toon, Carl Friedrich finished cleaning up the SPy improvement branch and fixed all warnings that occur when you translate SPy there. An obscure bug in an optimization prevented them from getting working executables, which at this moment blocks the merging of that branch.
By now everybody is home again (except for Anto, who booked his return flight two days too late, accidentally) and mostly resting. It was a good sprint, with some interesting results and several new people joining. And it was definitely the most unusual sprint location ever :-).
Berlin Sprint Day 1 + 2
After having survived the S3-Workshop which took place in Potsdam on Thursday and Friday (a blog-post about this will follow later) we are now sitting in the c-base in Berlin, happily sprinting. Below are some notes on what progress we made so far:
- The Gameboy emulator in RPython that Camillo Bruni is working on for his Bachelor project at Uni Bern does now translate. It took him (assisted by various people) a while to figure out the translation errors (essentially because he wrote nice Python code that passed bound methods around, which the RTyper doesn't completely like). Now that is fixed and the Gameboy emulator translates and runs a test ROM. You cannot really see anything yet, because there is no graphics support in RPython.
- To get graphics support in RPython Armin and Karl started writing SDL bindings for RPython, which both the Gameboy emulator and the SPy VM need. They have basic stuff working, probably enough to support the Gameboy already.
- Alexander, Armin, Maciek and Samuele discussed how to approach separate compilation for RPython, which isn't easy because the RPython type analysis is a whole-program analysis.
- Stephan, Peter and Adrian (at least in the beginning) worked on making PyPy's stackless module more complete. They added channel preferences which change details of the scheduling semantics.
- Toon, Carl Friedrich and Adrian (a tiny bit) worked on SPy. There is a branch that Toon started a while ago which contains many improvements but is also quite unclear in many respects. There was some progress in cleaning that up. This involved implementing the Smalltalk process scheduler (Smalltalk really is an OS). There is still quite some work left though. While doing so, we discovered many funny facts about Squeak's implementation details (most of which are exposed to the user) in the process. I guess we should collect them and blog about them eventually.
- Samuele and Maciek improved the ctypes version of pysqlite that Gerhard Häring started.
- Armin, Samuele and Maciek found an obscure bug in the interaction between the builtin-type-shortcut that Armin recently implemented and our multimethod implementation. It's not clear which of the two are to blame, however it seems rather unclear how to fix the problem: Armin and Samuele are stuck in a discussion about how to approach a solution since a while and are hard to talk to.
- Stijn Timbermont, a Ph.D. student at the Vrije Universiteit Brussel who is visiting the sprint for two days was first looking at how our GCs are implemented to figure out whether he can use PyPy for some experiments. The answer to that seems to be no. Today he was hacking on a Pico interpreter (without knowing too much about Python) and is making some nice progress, it seems.
Will try to blog more as the sprint progresses.
General performance improvements
Hi all,
During the past two weeks we invested some more efforts on the baseline performance of pypy-c. Some of the tweaks we did were just new ideas, and others were based on actual profiling. The net outcome is that we now expect PyPy to be in the worst case twice as slow than CPython on real applications. Here are some small-to-medium-size benchmark results. The number is the execution time, normalized to 1.0 for CPython 2.4:
- 1.90 on templess (a simple templating language)
- 1.49 on gadfly (pure Python SQL database)
- 1.49 on translate.py (pypy's own translation toolchain)
- 1.44 on mako (another templating system)
- 1.21 on pystone
- 0.78 on richards
(This is all without the JIT, as usual. The JIT is not ready yet.)
You can build yourself a pypy-c with this kind of speed with the magic command line (gcrootfinder is only for a 32-bit Linux machine):
pypy/translator/goal/translate.py --gc=hybrid --gcrootfinder=asmgcc targetpypystandalone --allworkingmodules --faassen
The main improvements come from:
- A general shortcut for any operation between built-in objects: for example, a subtraction of two integers or floats now dispatches directly to the integer or float subtraction code, without looking up the '__sub__' in the class.
- A shortcut for getting attributes out of instances of user classes when the '__getattribute__' special method is not overridden.
- The so-called Hybrid Garbage Collector is now a three-generations collector. More about our GCs...
- Some profiling showed bad performance in our implementation of the built-in id() -- a trivial function to write in CPython, but a lot more fun when you have a moving GC and your object's real address can change.
- The bytecode compiler's parser had a very slow linear search algorithm that we replaced with a dictionary lookup.
These benchmarks are doing CPU-intensive operations. You can expect
a similar blog post soon about the I/O performance, as the
io-improvements
branch gets closer to being merged
:-)
The branch could also improve the speed of
string operations, as used e.g. by the templating systems.
We had the same problem with id() (called object_id()) in Rubinius. We currently hide an objects's ID inside it's metaclass (allocating one if there isn't one).
Where did you guys store it?
The ID is stored in a special dictionary (a normal dictionary specialized to be allocated so that the GC wont see it) that is used in the GC as a mapping from addresses to integers. This dict is updated when necessary (usually when collecting).
The dictionary is of course only filled for objects that were used in an id() call.
There are actually two dictionaries, at least when using one of the generational GCs: one for the first generation objects and one for the rest. The dictionary for the rest of the objects can probably get quite large, but it needs to be traversed once during each full collection only. It seems that full collections are rare enough: the full dictionary updating doesn't stand out in profiled runs.
I didn't think about implementing id() at the language level, e.g. by extending the class of the object to add a field.
We can't really do that in RPython. Moreover, that seems impractical for Python: if someone asks for the id() of an integer object, do all integers suddenly need to grow an 'id' field?
Great work!
I have a few questions not answered by the FAQ that I hope someone will be able to answer.
When might the JIT be ready enough? (no stress, just asking :)
How much faster are CPython 2.5, 2.6 and 3.0? That seems to be relevant to the statement "we now expect PyPy to be in the worst case twice as slow than CPython".
If I understand correctly, one of the purposes of PyPy is to make experimentation easier - so will making it compatible with 3.0 be fairly easy? Are there plans to do so?
Is PyPy expected to one day become a serious "competitor" to CPython, in that you might want to run it in production? Is there a time set for when it will be ready for use by the general public (i.e me ;)?
So, answering questions one by one:
JIT will be ready when it'll be ready, not earlier.
CPython 2.5 is slightly faster for some operations. No real difference there. 2.6 was optimized for certain operations, but as well, don't expect a huge difference. I think you can expect pypy to be in range of 2x for any cpython. 3.0 is not even sure how will look like, but certainly being ultra fast is not it's primary goal.
Regarding making pypy compatible with 3.0 - yes, that should be fairly easy although we don't have any immediate plans doing that.
The final date for making pypy production ready is not set (and this is a gradual process), but as you can see here and here we're trying more and more to make it run existing applications.
Cheers,
fijal
Note that current benchmarks suggest that CPython 3.0 is yet much slower than CPython 2.x. It might be interesting to see whether this means that PyPy is much faster than CPython 3.0 running e.g. Pystone.
Of course this fact would not be very surprising, esp. given that PyPy does not implement any CPy3k features.
"JIT will be ready when it'll be ready, not earlier."
Alright, alright... we know.
But could you at least give us a very rough estimation for us, mere mortals? What does your heart tell you? :-)
Next Sprint: Berlin, May 17-22nd May
Our next PyPy sprint will take place in the crashed c-base space station, Berlin, Germany, Earth, Solar System. This is a fully public sprint: newcomers (from all planets) are welcome. Suggestion of topics (other topics are welcome too):
- work on PyPy's JIT generator: we are refactoring parts of the compiling logic, in ways that may also allow generating better machine code for loops (people or aliens with knowledge on compilers and SSA, welcome)
- work on the SPy VM, PyPy's Squeak implementation, particularly the graphics capabilities
- work on PyPy's GameBoy emulator, which also needs graphics support
- trying some large pure-Python applications or libraries on PyPy and fixing the resulting bugs. Possibilities are Zope 3, Django and others.
For more information, see the full announcement.
Google's Summer of Code
PyPy got one proposal accepted for Google's Summer of Code under the Python Software Foundation's umbrella. We welcome Bruno Gola into the PyPy community. He will work on supporting all Python 2.5 features in PyPy and will also update PyPy's standard library to support the modules that were modified or new in Python 2.5.
Right now PyPy supports only Python 2.4 fully (some Python 2.5 features have already sneaked in, though).
Float operations for JIT
Recently, we taught the JIT x86 backend how to produce code for the x87 floating point coprocessor. This means that JIT is able to nicely speed up float operations (this this is not true for our Python interpreter yet - we did not integrate it yet). This is the first time we started going beyond what is feasible in psyco - it would take a lot of effort to make floats working on top of psyco, way more than it will take on PyPy.
This work is in very early stage and lives on a jit-hotpath branch, which includes all our recent experiments on JIT compiler generation, including tracing JIT experiments and huge JIT refactoring.
Because we don't encode the Python's semantics in our JIT (which is really a JIT generator), it is expected that our Python interpreter with a JIT will become fast "suddenly", when our JIT generator is good enough. If this point is reached, we would also get fast interpreters for Smalltalk or JavaScript with relatively low effort.
Stay tuned.
Cheers,
fijal
Having a fast implementation of Ruby written in Python would be very cool. :-p
Super cool!
Are you going to add SIMD stuff to the i386 backend?
Which is the main backend at the moment? LLVM?
cheers,
It would be amazing to run SciPy on PyPy with the JIT when this will be ready.
I'm interested in the choice of x87 as well. My understanding was that Intel (at least) was keeping x87 floating point around because of binary applications but that for single element floating point the SSE single-element instructions were the preferred option on any processor which supports SSE. (Unfortunately since they've got such different styles of programming I can understand if it's just that "older chips have to be supported, and we've only got enough programming manpower for 1 implementation".)
x87 because it's simpler and better documented. Right now would be ridiculously easy to reimplement it using SSE.
The main backend is the one for 386. We have no working LLVM JIT backend: although llvm advertizes supporting JIT compilation, what it really provides is a regular compiler packaged as a library that can be used at run-time. This is only suitable for some kinds of usages; for example, it couldn't be used to write a Java VM with good just-in-time optimizations (which need e.g. quick and lazy code generation and regeneration, polymorphic inline caches, etc.)
How could GIL be removed from PyPy?
By using fine-grained locking: locking every dictionary and list while it is used. This is what Jython does (or more precisely, what Jython asks the JVM to do for it). This certainly comes with a performance penalty, so it would only pay off if you actually have and can use multiple CPUs -- which is fine in PyPy: you would just translate different pypy-c's depending on the use case.
This would be a pain to implement in CPython, in particular because of refcounting. Even if the Py_INCREF and Py_DECREF macros were made thread-safe, all C-level APIs that manipulate borrowed references might have to be redesigned.
Pyprocessing may serve multi-core cpu needs for the time being, as it's an almost drop-in replacement for the threading module.
I think it uses ctypes, so it should work with pypy.
pyprocessing has it's own problems (not that threads has no problems at all :)
1. Memory usage, you need basically n times more memory when n is number of processes
2. you cannot pass arbitrary data between processes, just stuff that you can marshal/pickle which is a bit huge limitation.
3. on the other hand, multiple processes provides you better control, although not via threading drop-in replacement.
Cheers,
fijal
The live demos seem to be down... :(
Back online. Our test server is down as well, which makes it a bit hard to know stuff :(
In response to maciej, OSes that implement copy-on-write fork (Linux, but not Windows, unsure about Mac OS X), don't take n times more memory. Fine-grained locking and an OpenMP-like syntax would be potentially useful. Maybe you could get a student to prototype these for you. But I'm sure someone will find a way to parallelize Python eventually, or we'll all switch to some other language, as the number of cores goes to infinity.
In my previous comment, I was partly wrong: COW reduces memory usage, however, in CPython the refcounting will cause the interpreter to write to every area of memory, so the reduction may not be that significant. Also, IronPython supports fine-grained locks.
Would it be better to lock not whole mutable object but just an element or slice(for lists) and not lock object for reading operations?
It's a common method used in DBMS. A small and fast realisation(if it's possible to create) in PyPy whould be great =)
Is there any calendar date for removal of GIL? or is it just a wish. Secondly, what is your speed aim compared with Java?
Thanks...
Rushen