Sunday, July 18, 2010: phpQuery makes scraping a hell lot easier. « from the old blog archive »
phpQuery is a PHP port of jQuery, so if you know how to use jQuery and PHP, you can easily use phpQuery. It's really awesome.
You just need to see it: http://code.google.com/p/phpquery/
I am using phpQuery for my DJMAX TECHNIKA Score Tracker's backend code which fetchs the score from Platinum Crew's website and puts it in MySQL table every 4 hours, which is then used by the frontend. It also uses the DJMAX Technika API (Thailand only ;)) to fetch some data instead of scraping in order to save bandwidth.
I made a simple utility function that makes scraping tabular data from web pages a lot easier.
All you need to do is throw in a selector for each column on that page. Just make sure that the selector for each column yields the same number of matching element, and then the function uses a parser function, which you define, to take that element and turn it into value.
You can find the selector for the column you want easily using SelectorGadget.
For example, the following code scans the DJ Title Rank page from Platinum Crew (Thailand)'s website for top 15 DJs.
Note that the callback function that I use is a class member of the PCParser class. That's why you see the callback function in arrays. Now here's the output.
It works like a charm. Whenever the website's design is updated or changed, I just update these selectors (using SelectorGadget gives me the selector in less than one minute) and everything works again.