Thoughts on Lucene, Solr, crawling and vertical search 

Posts about PHP

Download KhanAcademy videos with a PHP crawler

Posted by Kelvin on 08 Oct 2011 | Tagged as: PHP, programming

At the moment (October 2011), there's no simple way to download all videos from a playlist from KhanAcademy.org.

This simple PHP crawler script changes that. :-)

What it does is downloads the videos (from archive.org) to a subfolder, numbering and naming the videos with the respective titles (not the gibberish titles that archive.org has assigned them). Additionally, through the use of wget --continue, the crawler has auto-resume support, so even if your computer crashes in the middle of a crawl, you don't need to start all over again.

Usage

Usage is like this, assuming the script is named downkhan.php:

php downkhan.php {folder} {urls.txt}
php downkhan.php history history.txt
 

where folder is the subdirectory to save the videos in, and urls.txt is a list of urls obtained by running a regex on http://www.khanacademy.org/#browse.

Regex

The regex used was

href="(.*?)".*?><span.*?>(.*?)</span>
 

urls

Here is a few lines of a urls.txt file:

http://www.khanacademy.org/video/scale-of-earth-and--sun?playlist=Cosmology+and+Astronomy|Scale of Earth and  Sun
http://www.khanacademy.org/video/scale-of-solar-system?playlist=Cosmology+and+Astronomy|Scale of Solar System
http://www.khanacademy.org/video/scale-of-distance-to-closest-stars?playlist=Cosmology+and+Astronomy|Scale of Distance to Closest Stars
 

Here's a list of what I've created so far:

http://www.supermind.org/code/history.txt
http://www.supermind.org/code/biology.txt
http://www.supermind.org/code/finance.txt
http://www.supermind.org/code/cosmology.txt
http://www.supermind.org/code/healthcare.txt
http://www.supermind.org/code/linearalgebra.txt
http://www.supermind.org/code/statistics.txt

script code

And here's the script:

<?php
$args = $_SERVER['argv'];
$folder = $args[1];
$file = $args[2];

$arr = explode("\n", trim(file_get_contents(getcwd()."/".$file)));
$urls = array();
foreach($arr as $k) {
  $split = explode("|", $k);
  $urls[$split[0]] = $split[1];
}

mkdir($folder);
chdir($folder);
$counter = 0;

foreach($urls as $url=>$title) {
  $counter++;

  echo "Fetching $url\n";
  $html = '';
  while(!$html) $html = fetch_url($url);
  $vid = get_match("/<a href=\"(http:\/\/www.archive.org.*?)\"/", $html);
  $outfile = "$counter. $title.mp4";

  `wget --continue $vid -O "$outfile"`;  
}

function get_match($pattern, $s) {
  preg_match($pattern, $s, $matches);
  if($matches) {
    return $matches[1];
  } else return NULL;
}

function fetch_url($url)
{
    $curl_handle = curl_init(); // initialize curl handle
    curl_setopt($curl_handle, CURLOPT_URL, $url); // set url to post to
    curl_setopt($curl_handle, CURLOPT_FAILONERROR, 1);
    curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 5);
    curl_setopt($curl_handle, CURLINFO_TOTAL_TIME, 20);
    curl_setopt($curl_handle, CURLOPT_FOLLOWLOCATION, 1); // allow redirects
    curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1); // return into a variable
    curl_setopt($curl_handle, CURLOPT_HTTPHEADER, array('Accept: */*', 'User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows)'));
    $result = curl_exec($curl_handle); // run the whole process
    if (curl_exec($curl_handle) === false) {
        echo 'Curl error: ' . curl_error($curl_handle);
    }
    curl_close($curl_handle);
    return $result;
}

function rel2abs($rel, $base)
{
    /* return if already absolute URL */
    if (parse_url($rel, PHP_URL_SCHEME) != '') return $rel;

    /* queries and anchors */
    if ($rel[0] == '#' || $rel[0] == '?') return $base . $rel;

    /* parse base URL and convert to local variables:
 $scheme, $host, $path */

    extract(parse_url($base));

    /* remove non-directory element from path */
    $path = preg_replace('#/[^/]*$#', '', $path);

    /* destroy path if relative url points to root */
    if ($rel[0] == '/') $path = '';

    /* dirty absolute URL */
    $abs = "$host$path/$rel";

    /* replace '//' or '/./' or '/foo/../' with '/' */
    $re = array('#(/\.?/)#', '#/(?!\.\.)[^/]+/\.\./#');
    for ($n = 1; $n > 0; $abs = preg_replace($re, '/', $abs, -1, $n)) {
    }

    /* absolute URL is ready! */
    return $scheme . '://' . $abs;
}
 

Painless CRUD in PHP via AjaxCrud

Posted by Kelvin on 08 Oct 2011 | Tagged as: PHP, programming

I recently discovered an Ajax CRUD library which makes CRUD operations positively painless: AjaxCRUD

Its features include:

- displaying list in an inline-editable table
- generates a create form
- all operations (add,edit,delete) handled via ajax
- supports 1:many relations
- only 1 class to include!!

I highly recommend you try it out!

Here is the example code:

# the code for the class
include ('ajaxCRUD.class.php');

# this one line of code is how you implement the class
$tblCustomer = new ajaxCRUD("Customer",
                             "tblCustomer", "pkCustomerID");

# don't show the primary key in the table
$tblCustomer->omitPrimaryKey();

# my db fields all have prefixes;
# display headers as reasonable titles
$tblCustomer->displayAs("fldFName", "First");
$tblCustomer->displayAs("fldLName", "Last");
$tblCustomer->displayAs("fldPaysBy", "Pays By");
$tblCustomer->displayAs("fldDescription", "Customer Info");

# set the height for my textarea
$tblCustomer->setTextareaHeight('fldDescription', 100);

# define allowable fields for my dropdown fields
# (this can also be done for a pk/fk relationship)
$values = array("Cash", "Credit Card", "Paypal");
$tblCustomer->defineAllowableValues("fldPaysBy", $values);

# add the filter box (above the table)
$tblCustomer->addAjaxFilterBox("fldFName");

# actually show to the table
$tblCustomer->showTable();
 

PHP function to send an email with file attachment

Posted by Kelvin on 11 Jun 2011 | Tagged as: PHP, programming

Courtesy of http://www.finalwebsites.com/forums/topic/php-e-mail-attachment-script

function mail_attachment($filename, $path, $mailto, $from_mail, $from_name, $replyto, $subject, $message) {
    $file = $path.$filename;
    $file_size = filesize($file);
    $handle = fopen($file, "r");
    $content = fread($handle, $file_size);
    fclose($handle);
    $content = chunk_split(base64_encode($content));
    $uid = md5(uniqid(time()));
    $name = basename($file);
    $header = "From: ".$from_name." <".$from_mail.">\r\n";
    $header .= "Reply-To: ".$replyto."\r\n";
    $header .= "MIME-Version: 1.0\r\n";
    $header .= "Content-Type: multipart/mixed; boundary=\"".$uid."\"\r\n\r\n";
    $header .= "This is a multi-part message in MIME format.\r\n";
    $header .= "--".$uid."\r\n";
    $header .= "Content-type:text/plain; charset=iso-8859-1\r\n";
    $header .= "Content-Transfer-Encoding: 7bit\r\n\r\n";
    $header .= $message."\r\n\r\n";
    $header .= "--".$uid."\r\n";
    $header .= "Content-Type: application/octet-stream; name=\"".$filename."\"\r\n"; // use different content types here
    $header .= "Content-Transfer-Encoding: base64\r\n";
    $header .= "Content-Disposition: attachment; filename=\"".$filename."\"\r\n\r\n";
    $header .= $content."\r\n\r\n";
    $header .= "--".$uid."--";
    if (mail($mailto, $subject, "", $header)) {
        echo "mail send ... OK"; // or use booleans here
    } else {
        echo "mail send ... ERROR!";
    }
}
 

Prettyprint xml in PHP

Posted by Kelvin on 04 Dec 2010 | Tagged as: PHP

Ever wanted to format your XML nicely? Use the SimpleDOM class.

Usage is like so:

include "SimpleDOM.php";

$xml = "<foo><bar>car</bar></foo>";
$dom = simpledom_load_string($xml);
$xml = $dom->asPrettyXML();
echo $xml;
 

Produces:

<?xml version="1.0"?>
<foo>
  <bar>car</bar>
</foo>
 

URLizer: a WordPress plugin to automatically linkify URLs

Posted by Kelvin on 12 Oct 2010 | Tagged as: PHP, programming

Am I the only guy using WordPress who is too lazy to type out anchors?

Well, I've been using a WordPress plugin I wrote to automagically linkify URLs for a number of years now, and finally decided to add it to Google Code.

So here it is! http://code.google.com/p/urlizer/

Run php from html files on Dreamhost

Posted by Kelvin on 10 Oct 2010 | Tagged as: PHP, programming

Modify .htaccess to include this:

Correct

AddType php-cgi .html .htm
 

WRONG

AddType application/x-httpd-php .php .htm .html
 

or

AddHandler application/x-httpd-php .html
 

[SOLVED] Howto build the PHP rrdtool extension

Posted by Kelvin on 09 Oct 2010 | Tagged as: PHP, programming, Ubuntu

The definitive answer is here: http://www.samtseng.liho.tw/~samtz/blog/2009/03/11/howto-build-the-php-rrdtool-extension/

If you're on Ubuntu, do this first:

sudo apt-get install rrdtool librrd-dev php5-dev
 

Then follow the steps above.

[SOLVED] curl: (56) Received problem 2 in the chunky parser

Posted by Kelvin on 09 Oct 2010 | Tagged as: crawling, PHP, programming

The problem is described here:

http://curl.haxx.se/mail/lib-2006-04/0046.html

I successfully tracked the problem to the "Connection:" header. It seems that
if the "Connection: keep-alive" request header is not sent the server will
respond with data which is not chunked . It will still reply with a
"Transfer-Encoding: chunked" response header though.
I don't think this behavior is normal and it is not a cURL problem. I'll
consider the case closed but if somebody wants to make something about it I
can send additional info and test it further.

The workaround is simple: have curl use HTTP version 1.0 instead of 1.1.

In PHP, add this:

curl_setopt($curl_handle, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_0 );
 

A kick-ass PHP mysql escaping function

Posted by Kelvin on 31 Jul 2010 | Tagged as: PHP, programming

Hate calling mysql_real_escape_string repeatedly in your code? Use these functions cobbled together from http://www.php.net/manual/en/function.mysql-real-escape-string.php

/**
* USAGE: mysql_safe( string $query [, array $params ] )
* $query - SQL query WITHOUT any user-entered parameters. Replace parameters with "?"
*     e.g. $query = "SELECT date from history WHERE login = ?"
* $params - array of parameters
*
* Example:
*    mysql_safe( "SELECT secret FROM db WHERE login = ?", array($login) );    # one parameter
*    mysql_safe( "SELECT secret FROM db WHERE login = ? AND password = ?", array($login, $password) );    # multiple parameters
* That will result safe query to MySQL with escaped $login and $password.
**/

function mysql_safe($query,$params=false) {
    if ($params) {
        foreach ($params as &$v) { $v = db_escape($v); }    # Escaping parameters
        # str_replace - replacing ? -> %s. %s is ugly in raw sql query
        # vsprintf - replacing all %s to parameters
        $sql_query = vsprintf( str_replace("?","%s",$query), $params );
        $sql_query = mysql_query($sql_query);    # Perfoming escaped query
    } else {
        $sql_query = mysql_query($query);    # If no params...
    }

    return ($sql_query);
}

/**
 * Automatically adds quotes (unless $quotes is false), but only for strings. Null values are converted to mysql keyword "null", booleans are converted to 1 or 0, and numbers are left alone.
 * Also can escape a single variable or recursively escape an array of unlimited depth.
 */

function db_escape($values, $quotes = true) {
    if (is_array($values)) {
        foreach ($values as $key => $value) {
            $values[$key] = db_escape($value, $quotes);
        }
    }
    else if ($values === null) {
        $values = 'NULL';
    }
    else if (is_bool($values)) {
        $values = $values ? 1 : 0;
    }
    else if (!is_numeric($values)) {
        $values = mysql_real_escape_string($values);
        if ($quotes) {
            $values = '"' . $values . '"';
        }
    }
    return $values;
}
 

Usage

As a drop-in replacement for mysql_query when no placeholders (?) are used.

$result = mysql_safe("select 1 from table");
 

Use placeholders like so.

$result = mysql_safe("select ? from table where foo=?", array(1, "bar"));
 

The original mysql_safe function didn't escape numerics properly. The db_escape function does that nicely.

TokyoCabinet PHP Extension

Posted by Kelvin on 29 Jun 2010 | Tagged as: PHP, programming

I guess no one really interfaces directly with TokyoCabinet from PHP. For most cases, TokyoTyrant is probably more appropriate.

If you do need to though, check out http://code.google.com/p/1bacode/source/browse/trunk/front-end/extension/?r=12#extension/tokyocabinet.

Works great, and was surprisingly hard to find.

See my other post for help compiling the PHP extension.

Next Page »

02/23/2012 | Kelvin Tan | Lucene Solr Crawl Consultant