Author Archives: Wim

Sniper: a new multi-core simulator

Sniper is a next generation multi-threaded, high-speed and accurate x86-64 simulator. This microarchitectural simulator is based on the interval core model and the Graphite simulation infrastructure, allowing for fast and accurate simulation and for trading off simulation speed for accuracy to allow a range of flexible simulation options when exploring different micro-architectures. Using this methodology, we are able to achieve good accuracy against hardware for 16-thread applications.

The Sniper simulator allows one to perform timing simulations for multi-threaded, shared-memory applications with 10s to 100+ cores, at a high speed when compared to existing simulators. The main feature of the simulator is its core model which is based on the interval core model, a fast mechanistic core model. The interval model allows for faster simulations than typical cycle-accurate simulators by jumping between miss events because of long-latency operations. On recent machines, we see speeds of up to 2 MIPS for SPLASH-2 benchmarks, and almost 3 MIPS for SpecOMP benchmarks.

This simulator, and the interval core model, is useful for uncore and system-level studies that require more detail than the typical one-IPC models. As an added benefit, the interval core model allows the generation of CPI stacks, which shows the number of cycles lost due to different characteristics of the system, like the cache hierarchy or branch predictor, to be easily understood.

Sniper is available for download at

Revenge of the low-cost supercomputer – part 2

Following up on my TOP500 list by interconnect post, here’s a update with the latest, June 2010 TOP500 list.

First, for comparison, here’s the November 2009 version at the same scale:

And this is how the list looks now:

For one, given the number of recent announcements around it, I was expecting for 10G-Ethernet to gain some more adoption by now. Yet (assuming those details in the TOP500 list – and my parsing of it – are correct), only two new systems use 10G-Eth while those that were in last November’s list have now fallen out.

But something much more interesting seems to be happening at the lower right corner of the graph (look at the three blue (Inifiniband) dots nearest to the legend, interestingly, all three systems are Chinese). The Tianhe-1 hybrid Intel Xeon + ATI Radeon cluster (now at #7) got some company from two more GPGPU clusters, Mole-8.5 at #19 and even the new #2, Nebulae. They are characterized by a rather low efficiency – especially for this part of the ranking. Note that no-one else in the right half fools around with mere Gigabit Ethernet or otherwise has an efficiency that is lower than 70%. Yet the number two of this list only manages a 43% efficiency, and needs 2984 GFLOPS of raw computing power (28% more than the #1 system, Jaguar) to get at a LINPACK score of 28% less than Jaguar’s.

The explanation for this lies in two numbers that are not shown in this graph: power and cost. Just like the commodity-based clusters started to take over from the hard-core custom-built supercomputers some 10 years ago, the GPGPU-based system may very well be on its way to take over the charts. They follow the same basic recipe as those Beowulf-inspired clusters: a not-so-great efficiency, which is cured with loads of cheap, low-power processing power.

The efficiency of the cluster improved drastically over time, using better interconnect such as Infiniband. The cluster idea is now so prevalent that over 80% of the systems in the Top500 today should be categorized as clusters. The next question is how GPGPU will evolve into something that can combine its advantages of low cost and low power with increased efficiency. Nvidia has some clear ideas about HPC being the future for their products (although their first step, Fermi, got executed rather strangely…). ATI/AMD are also hard at work with FireStream and OpenCL. And coming from the other side of the arena, we have the general purpose processors moving towards simpler, but many more cores per chip, an idea embodied in the form of (among many others) Tilera’s TILE family or Intel’s SCC.

Interesting times again loom ahead. Let the games, eh computations, begin!

Historical currency converter web service

Looking for an excuse to try out Google AppEngine, and encouraged by someone on StackOverflow looking for a free web service to convert between currencies at historical dates, I built the Historical currency converter web service. Using a very simple RESTfull API, you can convert between all currencies on the ECB’s list, using exchange rates that date back to January 1999.

Continue reading

Questioner or answerer?

Yesterday on StackOverflow, I came across one of those users that kept asking questions, but didn’t really seem to understand much of the responses. Looking at his profile, it turned out he had asked over a hundred questions, but contributed less than ten answers. I won’t be tempted to start about his capabilities of actually answering any SO questions (although his understanding of other’s answers to his own questions, except when he was able to copy-paste someone’s source code, also didn’t seem to be that great), but it did get me thinking about what a ‘common’ ratio of questions versus answers would be for other SO users (personally, I’m at 1/85 right now). Of course, that triggered my data-analysis and graphing gene…

Continue reading

StackOverflow user diversity

I’ve been wondering what the diversity of knowledge of StackOverflow users would be like. It seemed like an interesting research idea to see how many people have responded only to questions in a very narrow field, and how many others have broader knowledge and can contribute useful answers in more diverse fields. Apparently, there is even supposed to be a badge for that (the Generalist badge), but it didn’t get implemented yet.

It’s easy to do this using tags: some sort of clustering should be applied according to how often each pair of tags shows up at the same question (a user that knows both ASP and shouldn’t be considered a ‘diverse’ person, so this should be factored out first), next we can count in how many different clusters that this user has contributed a good answer.

Continue reading

TOP500 list by interconnect

While attending a Hot Interconnects talk on supercomputing, I got the following idea. The TOP500 site provides graphs of the number of systems and total performance per interconnect family, which shows an approximate measure of the popularity of the different interconnects. But how do they affect the performance of an individual system? Clearly, a high-performance interconnect should result in higher efficiency than a commodity one. But by how much? And which systems would use what type of interconnect?

Continue reading

Tokina 11-16mm f/2.8 wide-angle lens

… or how to make a small space look huge.

I couldn’t resist, and bought another lens. This time a real wide-angle lens, the Tokina AF 11-16mm f/2.8 AT-X Pro DX. Yes, it’s a zoom lens, and on paper the range fitted nicely to my Tamron 17-50mm f/2.8. But the difference between 11 and 16mm turned out to be so tiny (and the image quality so good that cropping an 11mm image down to 16mm would be more than adequate), that I might as well have gotten a prime 11mm one. But still, a very nice lens this is!

Continue reading