May 26
Benchmarking Is Hard
Peter Markholm is right, benchmarking is hard. In typical fashion, I’m going to respond to a blog post response with another blog post.
Peter Markhold took issue with the improvements I made to the benchmark of grep, first, and smart match. So I felt compelled to explain why those are improvements and should actually improve the accuracy of the results. While my choice of wording may have been strong, my apologies to the original author (I can’t believe I criticized Michael Schwern’s code!), I stand by the improvements. Keep in mind, my opinions are formed from my time benchmarking things for my graduate work and for my employer on a few occasions.
The most important thing to providing a benchmark is the only thing within the timing loop is the operation you wish to time. In this case, we didn’t want to time the operation of creating the random data (rand). We didn’t want to time the data conversion and construction (chr and .) either. Those operations are things that the system may or may not optimize; since we have little control over how the system will optimize the code, we have to assume it adds unwanted ops to the timing loop. This is why I moved the creation of the @array outside of the annonymous sub. I left the annonymous sub because I didn’t want to write my own timing loop (Benchmark.pm can’t be used otherwise).
Also, we want to avoid any memory caching optimization that the system may provide for us. While this is typically useful, we absolutely do not want the system to load the entire @array of one test into memory and reuse the preloaded memory in another test. That could give an unreliable advantage to one of the tests depending on the system’s memory optimizer and the order of the tests. The reason why it’s a good idea to give each test it’s own set of data is it then avoid data prefetching.
Lastly, the argument that we might find the $needle in the beginning of the haystack is off track. With a large enough haystack, the probability that we will find the $needle within an eyeshot is negligible. We could increase the size of the haystack to be something like 100 billion random elements or even 1 trillion elements. Sure, the sample haystack was not large enough, but the testing method was much more sound (but not perfect).
Having said that, I will agree with Peter. My results absolutely should not be taken serious. There are many more things I would do differently. I would probably write my own timing loop so I could make sure the only thing within it would be the operation I want to test. I would also repeat the test a significant enough times with a variety of randomly generated large arrays (statistically large enough). I would test with different data types, different data structures, and different versions of all software involved. I’d also make sure not to test the code in multi-user mode with anything as few applications vying for the CPU as possible (this can have misleading results).
At the end of the day, those results were just examples. The changes were improvements but did not make for a complete and irrefutable test.
tags: code, perl, response
No comments
No Comments
Leave a comment