Archive for the 'Project Bootstrap' Category
The Fallacy of Industrial Expectation
I am what you would call a “professional student”; I have an Bachelors of Science in Computer Science, I am finishing a Masters of Science in Computer Science with a focus in Intelligent Systems, contemplating a PhD, and I have been a software engineer for 6 years now.
Recently, Joel Spolsky published yet another article about how he feels the universities of the world are churning out students incapable of doing the daily duties of software development. I’ve read other scathing articles about academia. I’ve even responded to many comments similar to “you have a degree but don’t know how to use ToolX or program in LanguageY.” These criticisms always irritate me (and strike me as originating from someone who begrudges those with degrees), so I want to set the record straight about academia. There are 2 simple points I want to get across:
- The university’s primary concern is to teach you core knowledge and how to obtain new knowledge in any field.
- Computer Science is a division of Applied Mathematics.
It’s that simple. At no point is it the university’s responsibility to teach students an arbitrary tool or language that the industry is consistant in its opinion. I read this really great response to his article that echos many of my complaints about this misconception.
Joel often comments that universities are trying to teach a particular language because it’s what the industry does or because MIT does it. That is wholly incorrect. The only reason why a university favors a particular language is so that the professors can focus on teaching towards and grading just one language, as that greatly simplifies their job. The choice of Java or Python is because you can express ideas simply and straight forward. The point isn’t to teach a language but to teach an idea expressed in a language.
If my undergraduate university had taught me specifically how to use CVS, that skill would essentially be wasted. Instead, they teach how versioning control systems work so that I may either implement one or just use one in my day to day job. Which sounds like a better idea in the long run?
Now keep in mind that CS is really just a division of Applied Mathematics. If you haven’t come to understand that, then you do not truly understand the field. In fact, the original “computers” were humans who computed mathematic equations.
Sure, most undergraduate assignments seem simple in comparison, but it’s because they don’t want to teach the peripheral tasks. Those tasks, such as testing, working in teams, and the latest “agile” techniques, are unrelated to the core understanding and vary widely within the industry. By understanding the core concepts, everything else is an extension of your existing knowledge.
If my university had taught me how to use FogBugz or how to write Perl TAP tests, I would have been looking for another university. My graduate school has yet to require I use a language or a tool and has yet teach a specific language and a tool. In the long run, that makes me a more adaptable developer and far more valuable to my employer.
Thesis In Frustration
So this semester I have been investigating and working on my thesis. Right now, my focus is in Statistical Natural Language Processing. I don’t want to discuss the specifics of the research just yet, but it has the potential of completely up-ending the entire search industry.
I have been investigating how to build a large corpus from the web. My advisor favors using Google directly since they already exist and provide their search for free.
The first thing I did was investigate the Google SOAP API only to find out that they deprecated it when they introduced the AJAX API. The new API only allows for about 60 results with no paging. Then I looked into the REST::Google API, but that only returns 10 results. Neither of those options seem feasible. I checked Yahoo’s Yahoo::Search interface and it only seemed to return 10 results (paging, if possible, was not obvious). I could write a direct scraper but that would take a good deal of effort and I am not sure it would be worth it.
Then I even started looking at writing my own spider using WWW::Robot. This is a fairly complex module that does a ton of grunt work for you. The downside is that it behaves and follows the robots.txt protocol; that’s a problem for someone who wants to scrape everything with no regard for such a protocol.
I spent maybe about 20-30 hours flipping over this in the last 6 weeks, I finally made the effort to meet with my advisor. Since he is no longer answering email or his phone, I met him after his late class and talked it over with him while he ate dinner in the campus restaurant. We talked and waffled back and forth about our approach. In the end, we decided to investigate Lucene’s capabilities.
Frustrated and lost, I went about my week until I talked with a PhD student currently being advised by my advisor as well. Her patience for our advisor has been continually declining. She missing a publication deadline because he failed to review a paper of hers. She also divulged that she intended on switching advisors because she is not making progress. I have been contemplaing this myself, so it was good to hear that I am not the only one at their wits’ end.
I am not making progress and I am not willing to sacrifice my graduation. If I change advisors, hopefully I will find an advisor that provides much more support and direction yet gives me the option to continue developing in perl. One of the professors I want to speak with runs a programming language lab.
Maybe I can merge my interest with Perl 6 with my thesis!
Falling Behind
I dropped off the Perl Ironman blogging challenge again. This time it wasn’t due to a date miscalculation; I took a midterm exam on Thursday. My current class has taken up most of my time in the past 6 weeks. There haven’t been any chances for coding just yet. I did want to talk about something I have been looking into lately for my thesis.
First, I want to post a lazyweb question: has anyone worked with REST::Google? My first question is if anyone knows how to advance the page cursor with this module. I tried reading the code itself, but it uses Class::Accessor and Class::Data. I’m a bit unfamiliar with those modules (are they popular anymore?), and it looks like the cursor is read-only. I don’t really see the use for the module if it returns 10 results and cannot paginate.
So from that question, I took the sample and tried playing with it. This is just experimental to exercise the module’s capabilities. This is fairly boring, so I want to see if I can get some code working that can paginate through Google results using their REST API (if that’s even possible).
The author of this module probably feels very clever; he hid subpackages from CPAN by putting the package name on a new line from the package keyword. They also used __PACKAGE__->mk_ro_accessors, which looks like it will generate attribute accessors at runtime. I’m guessing CPAN cannot index that as well. What’s the point of uploading your code to a public repository if you take measures to hide it from the repository?
Anyways, I’m soliciting ideas for paginating Google’s REST API results. Note: the SOAP API has been terminated, so that avenue is closed.
Semester Renewal
Just after I had been through a bit of a posting slump due to some fading tuits, it would seem as if they have magically returned. This week brings the start of the Fall 2009 semester and I have seemingly sprung back to life. With the new semester comes new opportunities to use perl!
First, I wanted to mention the one bit of news that made my summer, maybe even the last 2 years, seem as if I haven’t been running in circles. I noticed that my new department head at the university posted the final new rules regarding the qualification exams. The change is that there is a list of approved courses that will be offering the QE, passing any 3 is sufficient, and there is still no marginal grade. The last exam I took, Data / Text Mining in Bioinformatics, was not listed in the blessed course list. Turns out, the department would still accept the score, so I now only lack 1 remaining pass to be an official “phd candidate“.
This semester I am taking the Design & Analysis of Computer Algorithms course. The down side is this course tends to be theory-centric, so I won’t have many chances to flex my Perl muscles. There are tons of modules on CPAN though that might help understanding the basics. There are plenty of graph, tree, and dynamic programming solutions available.
Interestingly enough, while sifting though those modules, I discovered a module of personal interest. I stumbled across the Algorithm::Viterbi module. I have studied Markov models, Markov chains, and Hidden Markov models a bunch in the last 2 years. One algorithm that keeps showing up is the Viterbi algorithm. I’ll leave it as an exercise to the reader as to how this algorithm is used, but I will point out that the Wikipedia page has Python code. Ironically, “Python’s answer to CPAN” isn’t quite all it’s cracked up to be; it lacks any packages pertaining to “viterbi” and no generic Markov package.
Perl: Automatically tested, student approved.
Chip’s Core Hacker Presentation
At YAPC::NA, Chip Salzenberg held a last minute brief talk that really started making me think. Well, to be honest, the thought process started at the Parrot Workshop but really began in this talk.
I’ve always been fascinated with programming languages and compilers. These things had always struck me as not academically challenging and basically a solved engineering problem. I never really did much in the area except for an undergraduate course where I wrote compiler for TinyC, written in Perl, with a MIPS assembler code generator backend. It was a study in recursive decent parsers, so it lacked any real semantic capabilities, such as symbol tables, garbage collector, or ASTs.
So anyways, back to how this relates to Chip’s laid back talk. Basically, he encouraged people to become core Perl hackers. “Yes, things are bad, but they’re not that bad.” The talk only lasted 20 minutes, but my brain started spinning for the rest of the day. I tried having a few conversations, but I failed to sustain anything more than a few minutes. The talk wasn’t overly special. It wasn’t groundbreaking or funny; it was clearly an impromptu presentation with slides obviously created quickly; it lacked substantial content and it was not thought provoking. It was the perfect talk at the perfect time because my brain was in the perfect state of mind to think clearly and creatively.
So I have since turned my attention away from my thesis for the summer and towards compilers and programming languages. I have started filling my research notebook with all sorts of ideas in the chance that I spark a true flash of genius. My advisor is likely to be annoyed that I decided to change directions radically but he went overseas for the summer and has stopped responding to email.
Thank you Chip, you may very well have ignited my imagination and passion to create something truly worthwhile.
PAUSE For A Moment
I finally took the plunge and requested a PAUSE user account. PAUSE, for those uninformed, is the Perl Authors Upload Server. Through this, users are able to upload modules, scripts, and source code to CPAN.
I’ve been writing perl programs for maybe 7 years now but have never found anything I wrote to be useful to anyone other than myself. Then I started reading Linus Torvalds’ book “Just for Fun: The Story of an Accidental Revolutionary“. I’m not done reading it yet, but I will post a review when I finish.
In the 10 years or so of using Linux, I’ve never really contributed much. I wrote a quick program over on SourceForge but found I never wanted to use it myself. I’ve contributed a handful of patches to various projects. I even wrote the arbitrary radix parsing code in Rakudo. Yet I’ve never contributed any significant amount of code. I’ve always felt guity about that; like a leech who used other people’s work for my own selfish ends.
Well, that book changed everything.
I’ve got a backlog of a ton of little programs that all solve some domain specific problem. Most have been for school; some are completely boring while some produce highly interesting results. It’s time I start contributing my code back to the world.
I’m not sure what I’ll contribute first. Most of them need to be cleaned up from their current state of “get it working” to publishable code. I’m going to start working on that over the summer. There are a few projects that I am working on, one of them being my thesis, which can all ultimately find their way on CPAN.
Company Paid
I recently had the pleasure of attending a Human Language Technology showcase that lasted all day last Friday. I managed to convince my employer to foot the bill for the day, and here’s how I did it. Note that any identifying company information has been redacted.
The University of Texas at Dallas is hosting a showcase series focusing on Human Language Technologies on Friday, 23 January 2009. I was invited to attend as either a research student to my advisor or as an REDACTED employee / representative. Since my graduate focus will likely focus on these topics, I have an interest in attending.
The showcase series is a set of panels and talks about how Human Language Technologies apply to industry and its applications. There will be representatives from other companies, such as Raytheon, who will be presenting with other professors.
Our customer might have an interest in products that can analyze human languages and provide capabilities to interpretation. If REDACTED could provide such products integrated into our existing products, we might possibly be able to leverage greater power and capabilities that our customer might have an interest in.
My request is that REDACTED fund my attendance to the showcase since it is held midday. If the entire time cannot be supported, I am willing to split personal time off with REDACTED funding. I have attached the official flier that contains the panel topics and the schedule as well as a brief description of Human Language Technologies.
I would also like to give a brief talk about the discussions and how these fields could benefit REDACTED. I would prepare for this on my own time and am willing to give this talk to any interested party.
Thank you,
REDACTED
Turns out, my company paid for the entire day as offsite training. I am currently preparing the summarization and the presentation for them. I hope it generates some interest to move in areas we have yet to explore. This would be a great way to make a name for myself not only in my company but the industry.
I won’t be able to post anything I create for my employer for various reasons though.