Thesis In Frustration

October 25th, 2009 | Category: 01100011, Grinds My Gears, Project Bootstrap

So this semester I have been investigating and working on my thesis. Right now, my focus is in Statistical Natural Language Processing. I don’t want to discuss the specifics of the research just yet, but it has the potential of completely up-ending the entire search industry.

I have been investigating how to build a large corpus from the web. My advisor favors using Google directly since they already exist and provide their search for free.

The first thing I did was investigate the Google SOAP API only to find out that they deprecated it when they introduced the AJAX API. The new API only allows for about 60 results with no paging. Then I looked into the REST::Google API, but that only returns 10 results. Neither of those options seem feasible. I checked Yahoo’s Yahoo::Search interface and it only seemed to return 10 results (paging, if possible, was not obvious). I could write a direct scraper but that would take a good deal of effort and I am not sure it would be worth it.

Then I even started looking at writing my own spider using WWW::Robot. This is a fairly complex module that does a ton of grunt work for you. The downside is that it behaves and follows the robots.txt protocol; that’s a problem for someone who wants to scrape everything with no regard for such a protocol.

I spent maybe about 20-30 hours flipping over this in the last 6 weeks, I finally made the effort to meet with my advisor. Since he is no longer answering email or his phone, I met him after his late class and talked it over with him while he ate dinner in the campus restaurant. We talked and waffled back and forth about our approach. In the end, we decided to investigate Lucene’s capabilities.

Frustrated and lost, I went about my week until I talked with a PhD student currently being advised by my advisor as well. Her patience for our advisor has been continually declining. She missing a publication deadline because he failed to review a paper of hers. She also divulged that she intended on switching advisors because she is not making progress. I have been contemplaing this myself, so it was good to hear that I am not the only one at their wits’ end.

I am not making progress and I am not willing to sacrifice my graduation. If I change advisors, hopefully I will find an advisor that provides much more support and direction yet gives me the option to continue developing in perl. One of the professors I want to speak with runs a programming language lab.

Maybe I can merge my interest with Perl 6 with my thesis!

2 comments

Falling Behind

October 04th, 2009 | Category: 01100011, Project Bootstrap

I dropped off the Perl Ironman blogging challenge again. This time it wasn’t due to a date miscalculation; I took a midterm exam on Thursday. My current class has taken up most of my time in the past 6 weeks. There haven’t been any chances for coding just yet. I did want to talk about something I have been looking into lately for my thesis.

First, I want to post a lazyweb question: has anyone worked with REST::Google? My first question is if anyone knows how to advance the page cursor with this module. I tried reading the code itself, but it uses Class::Accessor and Class::Data. I’m a bit unfamiliar with those modules (are they popular anymore?), and it looks like the cursor is read-only. I don’t really see the use for the module if it returns 10 results and cannot paginate.

So from that question, I took the sample and tried playing with it. This is just experimental to exercise the module’s capabilities. This is fairly boring, so I want to see if I can get some code working that can paginate through Google results using their REST API (if that’s even possible).

The author of this module probably feels very clever; he hid subpackages from CPAN by putting the package name on a new line from the package keyword. They also used __PACKAGE__->mk_ro_accessors, which looks like it will generate attribute accessors at runtime. I’m guessing CPAN cannot index that as well. What’s the point of uploading your code to a public repository if you take measures to hide it from the repository?

Anyways, I’m soliciting ideas for paginating Google’s REST API results. Note: the SOAP API has been terminated, so that avenue is closed.

2 comments

Chip’s Core Hacker Presentation

July 06th, 2009 | Category: Meat Space, Permission For Flyby, Project Bootstrap

At YAPC::NA, Chip Salzenberg held a last minute brief talk that really started making me think. Well, to be honest, the thought process started at the Parrot Workshop but really began in this talk.

I’ve always been fascinated with programming languages and compilers. These things had always struck me as not academically challenging and basically a solved engineering problem. I never really did much in the area except for an undergraduate course where I wrote compiler for TinyC, written in Perl, with a MIPS assembler code generator backend. It was a study in recursive decent parsers, so it lacked any real semantic capabilities, such as symbol tables, garbage collector, or ASTs.

So anyways, back to how this relates to Chip’s laid back talk. Basically, he encouraged people to become core Perl hackers. “Yes, things are bad, but they’re not that bad.” The talk only lasted 20 minutes, but my brain started spinning for the rest of the day. I tried having a few conversations, but I failed to sustain anything more than a few minutes. The talk wasn’t overly special. It wasn’t groundbreaking or funny; it was clearly an impromptu presentation with slides obviously created quickly; it lacked substantial content and it was not thought provoking. It was the perfect talk at the perfect time because my brain was in the perfect state of mind to think clearly and creatively.

So I have since turned my attention away from my thesis for the summer and towards compilers and programming languages. I have started filling my research notebook with all sorts of ideas in the chance that I spark a true flash of genius. My advisor is likely to be annoyed that I decided to change directions radically but he went overseas for the summer and has stopped responding to email.

Thank you Chip, you may very well have ignited my imagination and passion to create something truly worthwhile.

No comments

Case Of Neglect

April 26th, 2009 | Category: 01100011, Zero-blog

It’s pretty obvious that I have been neglecting this space lately. It’s been well over 2 months since I last posted anything, and that was one of those annoying “updating things” updates.

Well, I’ve decided to partake in the Iron Man perl blogging challenge. Once a week, every week, I will post something, anything, about perl. This should be easy because it’s basically all I’ve been programming in lately and it’s easily my favorite language. My official start post is right now.

I’m working on an interesting new library that I’d like to think can change things in a big way for the open source community. I’m working on this with a fellow perl6 monger. Oh, you didn’t catch that? Yeah, I helped start the first Perl 6 Mongers group, right here in Dallas. DFW.pm is basically defunct; one of the problems that plagued it was the venue. Rather than trying to cater to everyone across this horribly sprawled metroplex, I think it would be best to just focus on a few key areas. We may rename it later, but Dallas.p6m seems fitting.

Speaking of which, I will be attending YAPC::NA 2009. My wife and I will treat this as our mini-vacation, since we haven’t taken one in 2 years! I contemplated giving a talk, but I decided I’m too new to this whole presentation thing and I’m not sure I have anything to really say. Hopefully next year I will have something interesting to give a lecture or talk over.

The semester is almost over and I’m still swamped to the gills. I’ve been neglecting everything lately due to the overwhelming amount of work I’ve been assigned. I will be really happy to make it through to the summer for some time off. I’ll still be working on my thesis, but that’s not nearly as pressing as normal class deadlines (unless you advisor makes you redo stuff right before you defend).

No comments