Download KhanAcademy videos with a PHP crawler
Posted by Kelvin on 08 Oct 2011 | Tagged as: PHP, programming
At the moment (October 2011), there's no simple way to download all videos from a playlist from KhanAcademy.org.
This simple PHP crawler script changes that.
What it does is downloads the videos (from archive.org) to a subfolder, numbering and naming the videos with the respective titles (not the gibberish titles that archive.org has assigned them). Additionally, through the use of wget --continue, the crawler has auto-resume support, so even if your computer crashes in the middle of a crawl, you don't need to start all over again.
Usage
Usage is like this, assuming the script is named downkhan.php:
php downkhan.php history history.txt
where folder is the subdirectory to save the videos in, and urls.txt is a list of urls obtained by running a regex on http://www.khanacademy.org/#browse.
Regex
The regex used was
urls
Here is a few lines of a urls.txt file:
http://www.khanacademy.org/video/scale-of-solar-system?playlist=Cosmology+and+Astronomy|Scale of Solar System
http://www.khanacademy.org/video/scale-of-distance-to-closest-stars?playlist=Cosmology+and+Astronomy|Scale of Distance to Closest Stars
Here's a list of what I've created so far:
http://www.supermind.org/code/history.txt
http://www.supermind.org/code/biology.txt
http://www.supermind.org/code/finance.txt
http://www.supermind.org/code/cosmology.txt
http://www.supermind.org/code/healthcare.txt
http://www.supermind.org/code/linearalgebra.txt
http://www.supermind.org/code/statistics.txt
script code
And here's the script:
$args = $_SERVER['argv'];
$folder = $args[1];
$file = $args[2];
$arr = explode("\n", trim(file_get_contents(getcwd()."/".$file)));
$urls = array();
foreach($arr as $k) {
$split = explode("|", $k);
$urls[$split[0]] = $split[1];
}
mkdir($folder);
chdir($folder);
$counter = 0;
foreach($urls as $url=>$title) {
$counter++;
echo "Fetching $url\n";
$html = '';
while(!$html) $html = fetch_url($url);
$vid = get_match("/<a href=\"(http:\/\/www.archive.org.*?)\"/", $html);
$outfile = "$counter. $title.mp4";
`wget --continue $vid -O "$outfile"`;
}
function get_match($pattern, $s) {
preg_match($pattern, $s, $matches);
if($matches) {
return $matches[1];
} else return NULL;
}
function fetch_url($url)
{
$curl_handle = curl_init(); // initialize curl handle
curl_setopt($curl_handle, CURLOPT_URL, $url); // set url to post to
curl_setopt($curl_handle, CURLOPT_FAILONERROR, 1);
curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($curl_handle, CURLINFO_TOTAL_TIME, 20);
curl_setopt($curl_handle, CURLOPT_FOLLOWLOCATION, 1); // allow redirects
curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1); // return into a variable
curl_setopt($curl_handle, CURLOPT_HTTPHEADER, array('Accept: */*', 'User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows)'));
$result = curl_exec($curl_handle); // run the whole process
if (curl_exec($curl_handle) === false) {
echo 'Curl error: ' . curl_error($curl_handle);
}
curl_close($curl_handle);
return $result;
}
function rel2abs($rel, $base)
{
/* return if already absolute URL */
if (parse_url($rel, PHP_URL_SCHEME) != '') return $rel;
/* queries and anchors */
if ($rel[0] == '#' || $rel[0] == '?') return $base . $rel;
/* parse base URL and convert to local variables:
$scheme, $host, $path */
extract(parse_url($base));
/* remove non-directory element from path */
$path = preg_replace('#/[^/]*$#', '', $path);
/* destroy path if relative url points to root */
if ($rel[0] == '/') $path = '';
/* dirty absolute URL */
$abs = "$host$path/$rel";
/* replace '//' or '/./' or '/foo/../' with '/' */
$re = array('#(/\.?/)#', '#/(?!\.\.)[^/]+/\.\./#');
for ($n = 1; $n > 0; $abs = preg_replace($re, '/', $abs, -1, $n)) {
}
/* absolute URL is ready! */
return $scheme . '://' . $abs;
}
Painless CRUD in PHP via AjaxCrud
Posted by Kelvin on 08 Oct 2011 | Tagged as: PHP, programming
I recently discovered an Ajax CRUD library which makes CRUD operations positively painless: AjaxCRUD
Its features include:
- displaying list in an inline-editable table
- generates a create form
- all operations (add,edit,delete) handled via ajax
- supports 1:many relations
- only 1 class to include!!
I highly recommend you try it out!
Here is the example code:
include ('ajaxCRUD.class.php');
# this one line of code is how you implement the class
$tblCustomer = new ajaxCRUD("Customer",
"tblCustomer", "pkCustomerID");
# don't show the primary key in the table
$tblCustomer->omitPrimaryKey();
# my db fields all have prefixes;
# display headers as reasonable titles
$tblCustomer->displayAs("fldFName", "First");
$tblCustomer->displayAs("fldLName", "Last");
$tblCustomer->displayAs("fldPaysBy", "Pays By");
$tblCustomer->displayAs("fldDescription", "Customer Info");
# set the height for my textarea
$tblCustomer->setTextareaHeight('fldDescription', 100);
# define allowable fields for my dropdown fields
# (this can also be done for a pk/fk relationship)
$values = array("Cash", "Credit Card", "Paypal");
$tblCustomer->defineAllowableValues("fldPaysBy", $values);
# add the filter box (above the table)
$tblCustomer->addAjaxFilterBox("fldFName");
# actually show to the table
$tblCustomer->showTable();
PHP function to send an email with file attachment
Posted by Kelvin on 11 Jun 2011 | Tagged as: PHP, programming
Courtesy of http://www.finalwebsites.com/forums/topic/php-e-mail-attachment-script
$file = $path.$filename;
$file_size = filesize($file);
$handle = fopen($file, "r");
$content = fread($handle, $file_size);
fclose($handle);
$content = chunk_split(base64_encode($content));
$uid = md5(uniqid(time()));
$name = basename($file);
$header = "From: ".$from_name." <".$from_mail.">\r\n";
$header .= "Reply-To: ".$replyto."\r\n";
$header .= "MIME-Version: 1.0\r\n";
$header .= "Content-Type: multipart/mixed; boundary=\"".$uid."\"\r\n\r\n";
$header .= "This is a multi-part message in MIME format.\r\n";
$header .= "--".$uid."\r\n";
$header .= "Content-type:text/plain; charset=iso-8859-1\r\n";
$header .= "Content-Transfer-Encoding: 7bit\r\n\r\n";
$header .= $message."\r\n\r\n";
$header .= "--".$uid."\r\n";
$header .= "Content-Type: application/octet-stream; name=\"".$filename."\"\r\n"; // use different content types here
$header .= "Content-Transfer-Encoding: base64\r\n";
$header .= "Content-Disposition: attachment; filename=\"".$filename."\"\r\n\r\n";
$header .= $content."\r\n\r\n";
$header .= "--".$uid."--";
if (mail($mailto, $subject, "", $header)) {
echo "mail send ... OK"; // or use booleans here
} else {
echo "mail send ... ERROR!";
}
}
Prettyprint xml in PHP
Posted by Kelvin on 04 Dec 2010 | Tagged as: PHP
URLizer: a WordPress plugin to automatically linkify URLs
Posted by Kelvin on 12 Oct 2010 | Tagged as: PHP, programming
Am I the only guy using WordPress who is too lazy to type out anchors?
Well, I've been using a WordPress plugin I wrote to automagically linkify URLs for a number of years now, and finally decided to add it to Google Code.
So here it is! http://code.google.com/p/urlizer/
Run php from html files on Dreamhost
Posted by Kelvin on 10 Oct 2010 | Tagged as: PHP, programming
Modify .htaccess to include this:
Correct
WRONG
or
[SOLVED] Howto build the PHP rrdtool extension
Posted by Kelvin on 09 Oct 2010 | Tagged as: PHP, programming, Ubuntu
The definitive answer is here: http://www.samtseng.liho.tw/~samtz/blog/2009/03/11/howto-build-the-php-rrdtool-extension/
If you're on Ubuntu, do this first:
Then follow the steps above.
[SOLVED] curl: (56) Received problem 2 in the chunky parser
Posted by Kelvin on 09 Oct 2010 | Tagged as: crawling, PHP, programming
The problem is described here:
http://curl.haxx.se/mail/lib-2006-04/0046.html
I successfully tracked the problem to the "Connection:" header. It seems that
if the "Connection: keep-alive" request header is not sent the server will
respond with data which is not chunked . It will still reply with a
"Transfer-Encoding: chunked" response header though.
I don't think this behavior is normal and it is not a cURL problem. I'll
consider the case closed but if somebody wants to make something about it I
can send additional info and test it further.
The workaround is simple: have curl use HTTP version 1.0 instead of 1.1.
In PHP, add this:
A kick-ass PHP mysql escaping function
Posted by Kelvin on 31 Jul 2010 | Tagged as: PHP, programming
Hate calling mysql_real_escape_string repeatedly in your code? Use these functions cobbled together from http://www.php.net/manual/en/function.mysql-real-escape-string.php
* USAGE: mysql_safe( string $query [, array $params ] )
* $query - SQL query WITHOUT any user-entered parameters. Replace parameters with "?"
* e.g. $query = "SELECT date from history WHERE login = ?"
* $params - array of parameters
*
* Example:
* mysql_safe( "SELECT secret FROM db WHERE login = ?", array($login) ); # one parameter
* mysql_safe( "SELECT secret FROM db WHERE login = ? AND password = ?", array($login, $password) ); # multiple parameters
* That will result safe query to MySQL with escaped $login and $password.
**/
function mysql_safe($query,$params=false) {
if ($params) {
foreach ($params as &$v) { $v = db_escape($v); } # Escaping parameters
# str_replace - replacing ? -> %s. %s is ugly in raw sql query
# vsprintf - replacing all %s to parameters
$sql_query = vsprintf( str_replace("?","%s",$query), $params );
$sql_query = mysql_query($sql_query); # Perfoming escaped query
} else {
$sql_query = mysql_query($query); # If no params...
}
return ($sql_query);
}
/**
* Automatically adds quotes (unless $quotes is false), but only for strings. Null values are converted to mysql keyword "null", booleans are converted to 1 or 0, and numbers are left alone.
* Also can escape a single variable or recursively escape an array of unlimited depth.
*/
function db_escape($values, $quotes = true) {
if (is_array($values)) {
foreach ($values as $key => $value) {
$values[$key] = db_escape($value, $quotes);
}
}
else if ($values === null) {
$values = 'NULL';
}
else if (is_bool($values)) {
$values = $values ? 1 : 0;
}
else if (!is_numeric($values)) {
$values = mysql_real_escape_string($values);
if ($quotes) {
$values = '"' . $values . '"';
}
}
return $values;
}
Usage
As a drop-in replacement for mysql_query when no placeholders (?) are used.
Use placeholders like so.
The original mysql_safe function didn't escape numerics properly. The db_escape function does that nicely.
TokyoCabinet PHP Extension
Posted by Kelvin on 29 Jun 2010 | Tagged as: PHP, programming
I guess no one really interfaces directly with TokyoCabinet from PHP. For most cases, TokyoTyrant is probably more appropriate.
If you do need to though, check out http://code.google.com/p/1bacode/source/browse/trunk/front-end/extension/?r=12#extension/tokyocabinet.
Works great, and was surprisingly hard to find.
