Ampersand
Today, only fRew and I ended up meeting for September Dallas.p6m meeting. Turns out several people forgot that today was in fact the Tuesday after Labor Day and not Monday. We mostly just sat and talked about miscellaneous software things. We discussed GPS, my Algorithms course, Parrot, PGE, and styling. Styling is the last thing that we discussed and one that seemed semi-heated.
I discussed my reasoning for my styling quirks, which fRew insisted he would replace in a heartbeat. I’ve mostly been honing my preference for certain things by finding bugs using styles that seem to lead towards common mistakes. For example, I always place if statement parenthesis immediately after the ‘f’ because you cannot have one without the other (well, you can have the boolean expression alone but it rarely makes sense unless it’s a return value). I tried to apply the same reasoning to why I put my braces at the end of the if line with a space between the ‘)’ and the ‘{‘. I do this because the block started by ‘{‘ can exist without if clause. I always keep the ‘{‘ on the same line so attempts to comment out the if clause will fail to compile as I’ve found and fixed too many bugs as a result of true laziness.
Now that may seem kinda wierd, but it’s 1) a mental seperation technique and 2) an attempt to reduce the number of standalone nested blocks (that can do odd things like cause variable scope issues).
Then we started discussing using ampersand, ‘&’, to begin functions. My reasoning is because I more often than not prefer to be as explicit as possible, and the ‘&’ let’s me do that. I failed to recall an example that led me to preferring the use of ampersand, but I eventually found it. Basically, functions should look like function calls and not keywords, macros, or other environment lexicals. &foo($arg1, $arg2); looks a bit hairy and dated (generally a perl4 way of doing things), but it’s clear from the first character what is about to happen. My brain needs only to parse the first character to read the code with the right mindset. I am calling a user defined function (not a built-in), named ‘foo’, and passing 2 arguments. That is clear, readable, and will likely work for the forseeable future; if not, it’s still easy to find and correct.
On the other hand, foo; or foo(); (under strict) is not necessarily clear. The first example is basically a bare-word and could be any number of things. It could be a symbol, a package, a string, or a function call. The arguments passed would be @_, which requires more investigation. The second one looks like a subroutine call but I have to parse 5 characters and then grep around for a sub named foo within the current namespace (was it loaded elsewhere and exported to my namespace?). While both of these are more compact and concise, they also both require more work to figure out exactly what is happening.
Also, foo($arg1, $arg2); is more clear but not until I’ve read the minimum of 4 characters to I start to think it might be a function call. This does not parse and skim nearly as quickly, at least to me.
All of this skimmable code talk (note: I don’t agree with Schwern, end-of-scope comments usually clutter code more than they help) may sound frivolous to those readers who deal with thousands of lines of code. It’s not something you can truely appreciate until you maintain code that weighs in with at least 9 digits (executable code only). I personally manage 250,000 lines and I am responsible for a product that is about 2 million lines (all 30 of our branches are about 2 million lines each).
In the end, I stick by my preference for ampersand function calls unless someone else can point out a better reason to ditch them.
Epoxy – The Glue That Holds Systems Together
I started a Perl 6 project some time ago from a few discussions during a Dallas.p6m (Perl 6 Mongers) meeting. The idea was to create a complete packaging system in Perl 6 for Perl 6 with Perl 6. That sounds confusing, so let me explain the project some more. In retrospect, this is a long post that may bore all but the most ardent reader.
First, you can find the epoxy project on github in my account and the epoxy-resin database in the perl6 account (since pmichaud gives out commit access liberally). Epoxy is the main system with the epoxy-resin project as the package database.
The idea is similar to Gentoo’s ebuild system. Instead of writing clever shell scripts and executing them in crafty ways, I decided to take another route. Perl developers clearly like doing 1 thing: writing perl code. What better way to build and release your perl code than by writing some more perl code? The key is most difficult and common tasks need to be automated and allow for a huge range of flexibility.
I chose to implement the project as a tool that uses ebuild-like files, called resin files, that provide all of the functionality necessary for a user or developer to find, build, install, and repackage a project. The goal was to fully replace Makefiles, Configure.pl scripts, ExtUtils::MakeMaker, Module::Build, Module::Install, and most importantly, CPAN and PAUSE. That sounds like a lofty goal, but it’s working out quite nicely so far.
As a sidenote, I have seen Mark Overmeer’s CPAN6 and PAUSE6. I have seen masak’s proto and dcarrera’s ppm. I have seen the PAR project and Software::Packager. I have written ebuilds and RPM spec files. I have worked with JAR packages and WAR packages. I have created JNLP files. I have worked with (but not necessarily created) packages for basically every mainstream Linux distribution. I decided that none of these solutions are interesting except for a few. I also concluded that CPAN and PAUSE as we know it need to be recycled. They’re old enough and Perl 6 is new enough that it’s time to reinvent the wheel. Rather than creating a crufty 1990’s wheel all over, let’s create a modern wheel capable of handing modern technology (for example, CPAN authors have migrated to github in droves).
A resin file is similar to the ebuild file. Rather than giving it a unique extension or artsy syntax, I decided to make the resin files classes. To provide a resin build and packager for your project, you inherit from the base resin file and override the functionality you need. The idea is you will have to implement the smallest subset of functionality required for epoxy to fetch, build, and package your project. That is, for a “hello world” project, you only need to subclass Epoxy::Resin and provide a BUILD submethod that sets your metadata.
The base class, Epoxy::Resin, provides base functions, known as targets, to do things such as fetch, build, dist (repack), install, test, clean, upgrade, and remove. Each module may either use the basic functionality or provide some custom actions for each target. A resin file declares the module’s metadata by setting public accessor values, such as author, website, and license.
There are several resin files already written, though they have to be constantly maintained as I improve the functionality of the shell and the base class. Those resin files are for masak’s projects right now since his are the most visible. I intend on writing more resin files once the current ones are smaller and some of their functionality resides in the base class.
There’s also a shell that does this work. I used Tene’s sweet dispatcher to handle the commands in the most sensible and shortest manner I could come up with. Currently, it only dispatches the shell commands out directly as targets, but I intend on grouping the targets into meta-commands so that common tasks are easier. Also, only a few targets are currently supported since I have to also write the targets for each of the resin files I currently maintain. This is a time management juggling act, so I decided what I have will suffice for now.
So once you look at it (or try it out for that matter), you’ll realize there is much to do. I won’t lie, it’s very early work. There is no test suite (I haven’t figured out the best way to integrate with rakudo’s Test.pm). There is duplicated code to save time solving a harder problem (such as the shell dispatch handlers and the build targets). Most of the targets are non-functional, such as upgrade, dist, and remove. It doesn’t currently attempt to follow the standard or even remotely care what the standard says on the matter. There is a lack of error handling, such as use-ing modules that don’t exist.
What I’m trying to say is that I understand how it is incomplete. Everything that it lacks is because I have not determined the best way to solve the problem yet. In some cases, I had to even work around known problems or a complete lack of an eco-system. The Epoxy::Fringe and Git modules are proof of that. I have the tuits and development will inch forward.
I haven’t decided on an issue tracker for the project yet. I tried to use Lighthouse but it seems unintuitive. In order to complete a milestone, you have to create tickets and mark them as complete. This is really just good for cyclic or maintenance projects but not for off-the-cuff new development. I’m currently thinking about github’s newly added Issue page, which looks like it will suffice for now.
I asked masak, a prominent member of the Perl 6 community to review my code. He had mostly positive things to say. He pointed out that he didn’t like the name of the module that builds Epoxy itself: Epoxy::Build. He told me about IO.prompt and how BUILD submethods are now supported. He also pointed out my lack of attention to the metadata attributes. These are all things on my TODO list which will be addressed eventually. He was surprised to find that I was replacing almost all of the Makefile, Makefile.in and Configure scripts to his projects.
This all started out as an exercise to learn Perl 6 better. I wanted to know the syntax and to start thinking of how things work. The best way to learn is to act. I am having fun working on this every other day or so. It’s a nice metal exercise that keeps me busy over the summer.
I have grand plans for this. I will take my time and focus on the code. I will not be distracted by a charged discussion without any working code. Concepts and ideals are great, but sometimes, code speaks for itself. I won’t argue about the future of CPAN or which way is the best to package a product. I won’t partake in the arguments because everyone seems to have an opinion but no solution. Those are arguments no one can win. Until you know that, you are useless.
I will however take feedback and constructive critisism. I don’t want to fan any flame wars or discuss what anyone thinks is the best idea for the future of Perl or Perl 6. This is my free time and the last thing I’m interested in is pointless heated discussions. I am always open to feedback, reviews, and assistance.
Gentoo Herd Abandons Perl
Wow, that title sounds pretty gloomy and not entirely meant as a news header. It’s true but it was never declared.
Let’s start out by looking at the b.g.o. 206455. Check out the header information on that bug. Let me reproduce the interesting bits:
Perl 5.10.0 was released about a month ago. I attach modified ebuilds and
patches that I used to install it successfully (?) on my systemReproducible: Always
Opened: 2008-01-17 19:39 0000
Current Status: New
In case you didn’t catch that, check out the bolded text one more time. That’s right, it’s been about 18 months since perl-5.10 was released and Gentoo still does not support it. How could a source based distribution that used to pride itself on bleeding edge support possibly fall so far behind?
Simple: the herd maintainers, both of them, have basically abandoned Gentoo.
This is interested and saddening for many reasons. I’m a long time user and supporter of Gentoo and it pains me to see it fall. In my opinion, Gentoo was the only distribution to get package management correct. I loved being able to test bleeding edge software before everyone else, including Debian and RedHat.
This situation also shows 2 problems with open source projects that you would not typically exist. First, maintaining distribution supplied versions of Perl and CPAN modules is loser’s game. It’s nearly impossible to update all of those ebuilds as fast as the developers of the modules themselves. g-cpan was a terrible project that never really worked well and no one wants to take over (as you can see from some of the recent comments). What I’m taking from this is that CPAN authors themselves are the most likely candidates to keep their modules building and installing since they’re doing something very similar already.
It also points out that with any project, there needs to be some level of satisfaction. It’s apparent from the herd’s (not so) recent commits that they lacked the desire to continue. This could stem from the historically poisonous Gentoo developer community, the difficulty in maintaining the ebuilds, or real life interference.
What can we learn from this? How can we possibly improve the situation? These are difficult questions and it’s heartening to see that concerned parties are finally starting to ask them publically. I think the first thing that needs to happen is to stop asking “where is perl-5.10″. It won’t be released on Gentoo by the herd. I’ve had to live with that for 18 months and now everyone else needs to as well. What we can do is try to improve the tools we have available, fix their problems, and write new tools to fill in the gaps. We need tools to help distributors with the ever-growing nature of CPAN. No one knows what needs to happen for that yet. Time for asking “where is it?” is over; now is the time to roll up our sleeves and get to work.
perl-5.10 on Gentoo should become a case study on how open source projects can succeed and fail. Gentoo itself is a case study with developer relations, but that’s a talk for another day.
Benchmarking Is Hard
Peter Markholm is right, benchmarking is hard. In typical fashion, I’m going to respond to a blog post response with another blog post.
Peter Markhold took issue with the improvements I made to the benchmark of grep, first, and smart match. So I felt compelled to explain why those are improvements and should actually improve the accuracy of the results. While my choice of wording may have been strong, my apologies to the original author (I can’t believe I criticized Michael Schwern’s code!), I stand by the improvements. Keep in mind, my opinions are formed from my time benchmarking things for my graduate work and for my employer on a few occasions.
The most important thing to providing a benchmark is the only thing within the timing loop is the operation you wish to time. In this case, we didn’t want to time the operation of creating the random data (rand). We didn’t want to time the data conversion and construction (chr and .) either. Those operations are things that the system may or may not optimize; since we have little control over how the system will optimize the code, we have to assume it adds unwanted ops to the timing loop. This is why I moved the creation of the @array outside of the annonymous sub. I left the annonymous sub because I didn’t want to write my own timing loop (Benchmark.pm can’t be used otherwise).
Also, we want to avoid any memory caching optimization that the system may provide for us. While this is typically useful, we absolutely do not want the system to load the entire @array of one test into memory and reuse the preloaded memory in another test. That could give an unreliable advantage to one of the tests depending on the system’s memory optimizer and the order of the tests. The reason why it’s a good idea to give each test it’s own set of data is it then avoid data prefetching.
Lastly, the argument that we might find the $needle in the beginning of the haystack is off track. With a large enough haystack, the probability that we will find the $needle within an eyeshot is negligible. We could increase the size of the haystack to be something like 100 billion random elements or even 1 trillion elements. Sure, the sample haystack was not large enough, but the testing method was much more sound (but not perfect).
Having said that, I will agree with Peter. My results absolutely should not be taken serious. There are many more things I would do differently. I would probably write my own timing loop so I could make sure the only thing within it would be the operation I want to test. I would also repeat the test a significant enough times with a variety of randomly generated large arrays (statistically large enough). I would test with different data types, different data structures, and different versions of all software involved. I’d also make sure not to test the code in multi-user mode with anything as few applications vying for the CPU as possible (this can have misleading results).
At the end of the day, those results were just examples. The changes were improvements but did not make for a complete and irrefutable test.
Yet Another Splice Function
We’ve all used splice. It’s a great function that basically encapsulates push, shift, unshift, and pop to name a few. While it’s overly useful, it can have its limitations. One limitation that has always bugged me was how it required indices. Sometimes, you may not know these or want to know these indices ahead of time. That’s why I created another splice function.
In this version, which I will officially call “yasplice”, it only takes references to 2 arrays and returns an array. It removes all of the elements in the second array from the first array by value rather than index. Note this requires perl5.10 because of the extremely useful smartmatch operator (~~). This snippet also overrides the splice function so this code should be localized.
use feature ':5.10';
use subs 'splice';
sub splice {
my ($aa, $bb) = @_;
grep { $one = $_; !grep { $_ ~~ $one } @$bb } @$aa;
}
@a = (1,2,4,5,7,5,4,2,6,6,6,5,4,4,5,6,6,4,3,4,5);
@b = (4,5,6);
@c = splice(\@a, \@b);
say join ", ", @c
Prints:
1, 2, 7, 2, 3
Simple and useful. It might not be a good idea to use this on N sized arrays as the runtime would then be N2.
While looking around for other interesting perl things, I found this use.perl.org blog post about smartmatch’s performance. That blog post illustrates a problem with his testing methodology. If I were to publish a paper with such a gaping hole like that, it would never be taken serious. In each test, he’s generating a random number and a random character and storing the result! That’s not part of the test. The $needles should each be generated outside the timing loop. This is adding tons of additional instructions that are not considered part of the test. These results are basically useless and should be redone.
So I did:
use Benchmark;
use 5.010;
use List::Util qw(first);
my $needle1 = chr(64+int(rand(26))).int(rand(1000)+1);
my $needle2 = chr(64+int(rand(26))).int(rand(1000)+1);
my $needle3 = chr(64+int(rand(26))).int(rand(1000)+1);
my $needle4 = chr(64+int(rand(26))).int(rand(1000)+1);
my @array1 = map { chr(64+int(rand(26)))."$_" } 1..1000;
my @array2 = map { chr(64+int(rand(26)))."$_" } 1..1000;
my @array3 = map { chr(64+int(rand(26)))."$_" } 1..1000;
my @array4 = map { chr(64+int(rand(26)))."$_" } 1..1000;
timethese(100_000, {
'first' => sub {
first { $_ eq $needle1 } @array1;
},
'grep BLOCK' => sub {
grep { $_ eq $needle2 } @array2;
},
'grep EXPR' => sub {
grep $_ eq $needle3, @array3;
},
'~~' => sub {
$needle4 ~~ @array4;
}
});
With the following results:
Benchmark: timing 100000 iterations of first, grep BLOCK, grep EXPR, ~~…
first: 20 wallclock secs (19.22 usr + 0.08 sys = 19.30 CPU) @ 5181.35/s (n=100000)
grep BLOCK: 15 wallclock secs (14.12 usr + 0.06 sys = 14.18 CPU) @ 7052.19/s (n=100000)
grep EXPR: 13 wallclock secs (12.96 usr + 0.06 sys = 13.02 CPU) @ 7680.49/s (n=100000)
~~: 5 wallclock secs ( 4.61 usr + 0.02 sys = 4.63 CPU) @ 21598.27/s (n=100000)
Those results are astonishing! Smart match is 4 times faster than first, not twice as fast as the original blog poster found. Notice that I changed each test to look for a different $needle in each @array. All of the randomly created data was done outside the timing loop so the only operations being tested is the creation of the annonymous sub, the execution of the timing, and the search function.
So I’m sure you could clearly write my yasplice function to search in a much quicker manner using smart match. I’ll leave that as an exercise to the reader as this is about all I want to write for today.