Supermind Search Consulting Blog 
Solr - ElasticSearch - Big Data

Posts about programming

Download KhanAcademy videos with a PHP crawler

Posted by Kelvin on 08 Oct 2011 | Tagged as: PHP, programming

At the moment (October 2011), there's no simple way to download all videos from a playlist from KhanAcademy.org.

This simple PHP crawler script changes that. :-)

What it does is downloads the videos (from archive.org) to a subfolder, numbering and naming the videos with the respective titles (not the gibberish titles that archive.org has assigned them). Additionally, through the use of wget –continue, the crawler has auto-resume support, so even if your computer crashes in the middle of a crawl, you don't need to start all over again.

Usage

Usage is like this, assuming the script is named downkhan.php:

php downkhan.php {folder} {urls.txt}
php downkhan.php history history.txt
 

where folder is the subdirectory to save the videos in, and urls.txt is a list of urls obtained by running a regex on http://www.khanacademy.org/#browse.

Regex

The regex used was

href="(.*?)".*?><span.*?>(.*?)</span>
 

urls

Here is a few lines of a urls.txt file:

http://www.khanacademy.org/video/scale-of-earth-and–sun?playlist=Cosmology+and+Astronomy|Scale of Earth and  Sun
http://www.khanacademy.org/video/scale-of-solar-system?playlist=Cosmology+and+Astronomy|Scale of Solar System
http://www.khanacademy.org/video/scale-of-distance-to-closest-stars?playlist=Cosmology+and+Astronomy|Scale of Distance to Closest Stars
 

Here's a list of what I've created so far:

http://www.supermind.org/code/history.txt
http://www.supermind.org/code/biology.txt
http://www.supermind.org/code/finance.txt
http://www.supermind.org/code/cosmology.txt
http://www.supermind.org/code/healthcare.txt
http://www.supermind.org/code/linearalgebra.txt
http://www.supermind.org/code/statistics.txt

script code

And here's the script:

<?php
$args = $_SERVER['argv'];
$folder = $args[1];
$file = $args[2];

$arr = explode("\n", trim(file_get_contents(getcwd()."/".$file)));
$urls = array();
foreach($arr as $k) {
  $split = explode("|", $k);
  $urls[$split[0]] = $split[1];
}

mkdir($folder);
chdir($folder);
$counter = 0;

foreach($urls as $url=>$title) {
  $counter++;

  echo "Fetching $url\n";
  $html = ";
  while(!$html) $html = fetch_url($url);
  $vid = get_match("/<a href=\"(http:\/\/www.archive.org.*?)\"/", $html);
  $outfile = "$counter. $title.mp4";

  `wget –continue $vid -O "$outfile"`;  
}

function get_match($pattern, $s) {
  preg_match($pattern, $s, $matches);
  if($matches) {
    return $matches[1];
  } else return NULL;
}

function fetch_url($url)
{
    $curl_handle = curl_init(); // initialize curl handle
    curl_setopt($curl_handle, CURLOPT_URL, $url); // set url to post to
    curl_setopt($curl_handle, CURLOPT_FAILONERROR, 1);
    curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 5);
    curl_setopt($curl_handle, CURLINFO_TOTAL_TIME, 20);
    curl_setopt($curl_handle, CURLOPT_FOLLOWLOCATION, 1); // allow redirects
    curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1); // return into a variable
    curl_setopt($curl_handle, CURLOPT_HTTPHEADER, array('Accept: */*', 'User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows)'));
    $result = curl_exec($curl_handle); // run the whole process
    if (curl_exec($curl_handle) === false) {
        echo 'Curl error: ' . curl_error($curl_handle);
    }
    curl_close($curl_handle);
    return $result;
}

function rel2abs($rel, $base)
{
    /* return if already absolute URL */
    if (parse_url($rel, PHP_URL_SCHEME) != ") return $rel;

    /* queries and anchors */
    if ($rel[0] == '#' || $rel[0] == '?') return $base . $rel;

    /* parse base URL and convert to local variables:
 $scheme, $host, $path */

    extract(parse_url($base));

    /* remove non-directory element from path */
    $path = preg_replace('#/[^/]*$#', ", $path);

    /* destroy path if relative url points to root */
    if ($rel[0] == '/') $path = ";

    /* dirty absolute URL */
    $abs = "$host$path/$rel";

    /* replace '//' or '/./' or '/foo/../' with '/' */
    $re = array('#(/\.?/)#', '#/(?!\.\.)[^/]+/\.\./#');
    for ($n = 1; $n > 0; $abs = preg_replace($re, '/', $abs, -1, $n)) {
    }

    /* absolute URL is ready! */
    return $scheme . '://' . $abs;
}
 

Painless CRUD in PHP via AjaxCrud

Posted by Kelvin on 08 Oct 2011 | Tagged as: PHP, programming

I recently discovered an Ajax CRUD library which makes CRUD operations positively painless: AjaxCRUD

Its features include:

– displaying list in an inline-editable table
– generates a create form
– all operations (add,edit,delete) handled via ajax
– supports 1:many relations
– only 1 class to include!!

I highly recommend you try it out!

Here is the example code:

# the code for the class
include ('ajaxCRUD.class.php');

# this one line of code is how you implement the class
$tblCustomer = new ajaxCRUD("Customer",
                             "tblCustomer", "pkCustomerID");

# don't show the primary key in the table
$tblCustomer->omitPrimaryKey();

# my db fields all have prefixes;
# display headers as reasonable titles
$tblCustomer->displayAs("fldFName", "First");
$tblCustomer->displayAs("fldLName", "Last");
$tblCustomer->displayAs("fldPaysBy", "Pays By");
$tblCustomer->displayAs("fldDescription", "Customer Info");

# set the height for my textarea
$tblCustomer->setTextareaHeight('fldDescription', 100);

# define allowable fields for my dropdown fields
# (this can also be done for a pk/fk relationship)
$values = array("Cash", "Credit Card", "Paypal");
$tblCustomer->defineAllowableValues("fldPaysBy", $values);

# add the filter box (above the table)
$tblCustomer->addAjaxFilterBox("fldFName");

# actually show to the table
$tblCustomer->showTable();
 

HOWTO: Collect WebDriver HTTP Request and Response Headers

Posted by Kelvin on 22 Jun 2011 | Tagged as: crawling, Lucene / Solr / Elastic Search / Nutch, programming

WebDriver, is a fantastic Java API for web application testing. It has recently been merged into the Selenium project to provide a friendlier API for programmatic simulation of web browser actions. Its unique property is that of executing web pages on web browsers such as Firefox, Chrome, IE etc, and the subsequent programmatic access of the DOM model.

The problem with WebDriver, though, as reported here, is that because the underlying browser implementation does the actual fetching, as opposed to, Commons HttpClient, for example, its currently not possible to obtain the HTTP request and response headers, which is kind of a PITA.

I present here a method of obtaining HTTP request and response headers via an embedded proxy, derived from the Proxoid project.

ProxyLight from Proxoid

ProxyLight is the lightweight standalone proxy from the Proxoid project. It's released under the Apache Public License.

The original code only provided request filtering, and performed no response filtering, forwarding data directly from the web server to the requesting client.

I made some modifications to intercept and parse HTTP response headers.

Get my version here (released under APL): http://downloads.supermind.org/proxylight-20110622.zip

Using ProxyLight from WebDriver

The modified ProxyLight allows you to process both request and response.

This has the added benefit allowing you to write a RequestFilter which ignores images, or URLs from certain domains. Sweet!

What your WebDriver code has to do then, is:

  1. Ensure the ProxyLight server is started
  2. Add Request and Response Filters to the ProxyLight server
  3. Maintain a cache of request and response filters which you can then retrieve
  4. Ensure the native browser uses our ProxyLight server

Here's a sample class to get you started

package org.supermind.webdriver;

import com.mba.proxylight.ProxyLight;
import com.mba.proxylight.Response;
import com.mba.proxylight.ResponseFilter;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.firefox.FirefoxProfile;

import java.util.LinkedHashMap;
import java.util.Map;

public class SampleWebDriver {
  protected int localProxyPort = 5368;
  protected ProxyLight proxy;

  // LRU response table. Note: this is not thread-safe.
  // Use ConcurrentLinkedHashMap instead: http://code.google.com/p/concurrentlinkedhashmap/
  private LinkedHashMap<String, Response> responseTable = new LinkedHashMap<String, Response>() {
    protected boolean removeEldestEntry(Map.Entry eldest) {
      return size() > 100;
    }
  };

  public Response fetch(String url) {
    if (proxy == null) {
      initProxy();
    }
     FirefoxProfile profile = new FirefoxProfile();

    /**
     * Get the native browser to use our proxy
     */

    profile.setPreference("network.proxy.type", 1);
    profile.setPreference("network.proxy.http", "localhost");
    profile.setPreference("network.proxy.http_port", localProxyPort);

    FirefoxDriver driver = new FirefoxDriver(profile);

    // Now fetch the URL
    driver.get(url);

    Response proxyResponse = responseTable.remove(driver.getCurrentUrl());

    return proxyResponse;
  }

  private void initProxy() {
    proxy = new ProxyLight();

    this.proxy.setPort(localProxyPort);

    // this response filter adds the intercepted response to the cache
    this.proxy.getResponseFilters().add(new ResponseFilter() {
      public void filter(Response response) {
        responseTable.put(response.getRequest().getUrl(), response);
      }
    });

    // add request filters here if needed

    // now start the proxy
    try {
      this.proxy.start();
    } catch (Exception e) {
      e.printStackTrace();
    }
  }

  public static void main(String[] args) {
    SampleWebDriver driver = new SampleWebDriver();
    Response res = driver.fetch("http://www.lucenetutorial.com");
    System.out.println(res.getHeaders());
  }
}

 

Solr 3.2 released!

Posted by Kelvin on 22 Jun 2011 | Tagged as: crawling, Lucene / Solr / Elastic Search / Nutch, programming

I'm a little slow off the block here, but I just wanted to mention that Solr 3.2 had been released!

Get your download here: http://www.apache.org/dyn/closer.cgi/lucene/solr

Solr 3.2 release highlights include

  • Ability to specify overwrite and commitWithin as request parameters when using the JSON update format
  • TermQParserPlugin, useful when generating filter queries from terms returned from field faceting or the terms component.
  • DebugComponent now supports using a NamedList to model Explanation objects in it's responses instead of Explanation.toString
  • Improvements to the UIMA and Carrot2 integrations

I had personally been looking forward to the overwrite request param addition to JSON update format, so I'm delighted about this release.

Great work guys!

Classical learning curves for some editors

Posted by Kelvin on 20 Jun 2011 | Tagged as: programming

PHP function to send an email with file attachment

Posted by Kelvin on 11 Jun 2011 | Tagged as: PHP, programming

Courtesy of http://www.finalwebsites.com/forums/topic/php-e-mail-attachment-script

function mail_attachment($filename, $path, $mailto, $from_mail, $from_name, $replyto, $subject, $message) {
    $file = $path.$filename;
    $file_size = filesize($file);
    $handle = fopen($file, "r");
    $content = fread($handle, $file_size);
    fclose($handle);
    $content = chunk_split(base64_encode($content));
    $uid = md5(uniqid(time()));
    $name = basename($file);
    $header = "From: ".$from_name." <".$from_mail.">\r\n";
    $header .= "Reply-To: ".$replyto."\r\n";
    $header .= "MIME-Version: 1.0\r\n";
    $header .= "Content-Type: multipart/mixed; boundary=\"".$uid."\"\r\n\r\n";
    $header .= "This is a multi-part message in MIME format.\r\n";
    $header .= "–".$uid."\r\n";
    $header .= "Content-type:text/plain; charset=iso-8859-1\r\n";
    $header .= "Content-Transfer-Encoding: 7bit\r\n\r\n";
    $header .= $message."\r\n\r\n";
    $header .= "–".$uid."\r\n";
    $header .= "Content-Type: application/octet-stream; name=\"".$filename."\"\r\n"; // use different content types here
    $header .= "Content-Transfer-Encoding: base64\r\n";
    $header .= "Content-Disposition: attachment; filename=\"".$filename."\"\r\n\r\n";
    $header .= $content."\r\n\r\n";
    $header .= "–".$uid."–";
    if (mail($mailto, $subject, "", $header)) {
        echo "mail send … OK"; // or use booleans here
    } else {
        echo "mail send … ERROR!";
    }
}
 

How to revert a svn commit

Posted by Kelvin on 23 May 2011 | Tagged as: programming

I recently had to revert a svn commit of a developer who was absolutely CLUELESS about how subversion works and ended up undoing a bunch of my changes. ARGH!

I decided to rollback ALL her changes and let her reapply the commits. Here's how to do it:

svn merge -r [current revision]:[last good revision] .
 

for example

svn merge -r 90:88 .
svn commit -m "Undoing a clueless commit"
 

Recursively find the n latest modified files in a directory

Posted by Kelvin on 18 May 2011 | Tagged as: programming, Ubuntu

This entry is part 13 of 19 in the Bash-whacking series

Here's how to find the latest modified files in a directory. Particularly useful when you've made some changes and can't remember what!

find . -type f -printf '%T@ %p\n' | sort -n | tail -1 | cut -f2- -d" "
 

Replace tail -1 with tail -20 to list the 20 most recent files for example.

Courtesy of StackOverflow: http://stackoverflow.com/questions/4561895/how-to-recursively-find-the-latest-modified-file-in-a-directory

Convert fixed-width file to CSV

Posted by Kelvin on 12 May 2011 | Tagged as: programming, Ubuntu

This entry is part 12 of 19 in the Bash-whacking series

After trying various sed/awk recipes to convert from fixed-width to CSV, I found a Python script that works well.

Here it is, from http://code.activestate.com/recipes/452503-convert-db-fixed-width-output-to-csv-format/

## {{{ http://code.activestate.com/recipes/452503/ (r1)
# Ian Maurer
# http://itmaurer.com/
# Convert a Fixed Width file to a CSV with Headers
#
# Requires following format:
#
# header1      header2 header3
# ———— ——- —————-
# data_a1      data_a2 data_a3

def writerow(ofile, row):
    for i in range(len(row)):
        row[i] = '"' + row[i].replace('"', ") + '"'
    data = ",".join(row)
    ofile.write(data)
    ofile.write("\n")

def convert(ifile, ofile):
    header = ifile.readline().strip()
    while not header:
        header = ifile.readline().strip()

    hticks = ifile.readline().strip()
    csizes = [len(cticks) for cticks in hticks.split()]
   
    line = header
    while line:

        start, row = 0, []
        for csize in csizes:
            column = line[start:start+csize].strip()
            row.append(column)
            start = start + csize + 1

        writerow(ofile, row)
        line = ifile.readline().strip()

if __name__ == "__main__":
    import sys
    if len(sys.argv) == 3:
        ifile = open(sys.argv[1], "r")
        ofile = open(sys.argv[2], "w+")
        convert(ifile, ofile)
       
    else:
        print "Usage: python convert.py <input> <output>"
## end of http://code.activestate.com/recipes/452503/ }}}
 

Application-wide keyboard shortcuts in Swing

Posted by Kelvin on 21 Apr 2011 | Tagged as: programming

Swing's focus subsystem of keyboard events are fired specific to the component in focus.

One way of implementing application-wide keyboard shortcuts is to add it to _every_ component that is created. (yes, its as ridonkulous as it sounds)

Here's another way, using KeyboardFocusManager:

  // Add Ctrl-W listener to quit application
    KeyboardFocusManager.getCurrentKeyboardFocusManager().addKeyEventDispatcher(new KeyEventDispatcher(){

      public boolean dispatchKeyEvent(KeyEvent e) {
        if (e.getKeyCode() == java.awt.event.KeyEvent.VK_W && e.getModifiers() == java.awt.event.InputEvent.CTRL_MASK) {
          System.exit(0);
          return true;
        }
        return false;
      }
    });
 

« Previous PageNext Page »