Solr 3.2 released!
Posted by Kelvin on 22 Jun 2011 | Tagged as: crawling, Lucene / Solr / Elastic Search / Nutch, programming
I'm a little slow off the block here, but I just wanted to mention that Solr 3.2 had been released!
Get your download here: http://www.apache.org/dyn/closer.cgi/lucene/solr
Solr 3.2 release highlights include
- Ability to specify overwrite and commitWithin as request parameters when using the JSON update format
- TermQParserPlugin, useful when generating filter queries from terms returned from field faceting or the terms component.
- DebugComponent now supports using a NamedList to model Explanation objects in it's responses instead of Explanation.toString
- Improvements to the UIMA and Carrot2 integrations
I had personally been looking forward to the overwrite request param addition to JSON update format, so I'm delighted about this release.
Great work guys!
Classical learning curves for some editors
Posted by Kelvin on 20 Jun 2011 | Tagged as: programming
PHP function to send an email with file attachment
Posted by Kelvin on 11 Jun 2011 | Tagged as: PHP, programming
Courtesy of http://www.finalwebsites.com/forums/topic/php-e-mail-attachment-script
$file = $path.$filename;
$file_size = filesize($file);
$handle = fopen($file, "r");
$content = fread($handle, $file_size);
fclose($handle);
$content = chunk_split(base64_encode($content));
$uid = md5(uniqid(time()));
$name = basename($file);
$header = "From: ".$from_name." <".$from_mail.">\r\n";
$header .= "Reply-To: ".$replyto."\r\n";
$header .= "MIME-Version: 1.0\r\n";
$header .= "Content-Type: multipart/mixed; boundary=\"".$uid."\"\r\n\r\n";
$header .= "This is a multi-part message in MIME format.\r\n";
$header .= "--".$uid."\r\n";
$header .= "Content-type:text/plain; charset=iso-8859-1\r\n";
$header .= "Content-Transfer-Encoding: 7bit\r\n\r\n";
$header .= $message."\r\n\r\n";
$header .= "--".$uid."\r\n";
$header .= "Content-Type: application/octet-stream; name=\"".$filename."\"\r\n"; // use different content types here
$header .= "Content-Transfer-Encoding: base64\r\n";
$header .= "Content-Disposition: attachment; filename=\"".$filename."\"\r\n\r\n";
$header .= $content."\r\n\r\n";
$header .= "--".$uid."--";
if (mail($mailto, $subject, "", $header)) {
echo "mail send ... OK"; // or use booleans here
} else {
echo "mail send ... ERROR!";
}
}
How to revert a svn commit
Posted by Kelvin on 23 May 2011 | Tagged as: programming
I recently had to revert a svn commit of a developer who was absolutely CLUELESS about how subversion works and ended up undoing a bunch of my changes. ARGH!
I decided to rollback ALL her changes and let her reapply the commits. Here's how to do it:
for example
svn commit -m "Undoing a clueless commit"
Recursively find the n latest modified files in a directory
Posted by Kelvin on 18 May 2011 | Tagged as: programming, Ubuntu
Here's how to find the latest modified files in a directory. Particularly useful when you've made some changes and can't remember what!
Replace tail -1 with tail -20 to list the 20 most recent files for example.
Courtesy of StackOverflow: http://stackoverflow.com/questions/4561895/how-to-recursively-find-the-latest-modified-file-in-a-directory
Convert fixed-width file to CSV
Posted by Kelvin on 12 May 2011 | Tagged as: programming, Ubuntu
After trying various sed/awk recipes to convert from fixed-width to CSV, I found a Python script that works well.
Here it is, from http://code.activestate.com/recipes/452503-convert-db-fixed-width-output-to-csv-format/
## {{{ http://code.activestate.com/recipes/452503/ (r1)
# Ian Maurer
# http://itmaurer.com/
# Convert a Fixed Width file to a CSV with Headers
#
# Requires following format:
#
# header1 header2 header3
# ------------ ------- ----------------
# data_a1 data_a2 data_a3
def writerow(ofile, row):
for i in range(len(row)):
row[i] = '"' + row[i].replace('"', '') + '"'
data = ",".join(row)
ofile.write(data)
ofile.write("\n")
def convert(ifile, ofile):
header = ifile.readline().strip()
while not header:
header = ifile.readline().strip()
hticks = ifile.readline().strip()
csizes = [len(cticks) for cticks in hticks.split()]
line = header
while line:
start, row = 0, []
for csize in csizes:
column = line[start:start+csize].strip()
row.append(column)
start = start + csize + 1
writerow(ofile, row)
line = ifile.readline().strip()
if __name__ == "__main__":
import sys
if len(sys.argv) == 3:
ifile = open(sys.argv[1], "r")
ofile = open(sys.argv[2], "w+")
convert(ifile, ofile)
else:
print "Usage: python convert.py <input> <output>"
## end of http://code.activestate.com/recipes/452503/ }}}
Application-wide keyboard shortcuts in Swing
Posted by Kelvin on 21 Apr 2011 | Tagged as: programming
Swing's focus subsystem of keyboard events are fired specific to the component in focus.
One way of implementing application-wide keyboard shortcuts is to add it to _every_ component that is created. (yes, its as ridonkulous as it sounds)
Here's another way, using KeyboardFocusManager:
KeyboardFocusManager.getCurrentKeyboardFocusManager().addKeyEventDispatcher(new KeyEventDispatcher(){
public boolean dispatchKeyEvent(KeyEvent e) {
if (e.getKeyCode() == java.awt.event.KeyEvent.VK_W && e.getModifiers() == java.awt.event.InputEvent.CTRL_MASK) {
System.exit(0);
return true;
}
return false;
}
});
Working MySQL 5.1+ Levenshtein Stored Procedure
Posted by Kelvin on 13 Apr 2011 | Tagged as: programming
Update: Changed 0×00 to '\0' as per Jan-Hendrik's comment below.
There are a number of MySQL functions for calculating Levenshtein distance floating around StackOverflow and other forums. They all seem to be based off http://codejanitor.com/wp/2007/02/10/levenshtein-distance-as-a-mysql-stored-function/ (broken link).
Anyway, I couldn't get them to work for me. MySQL complained:
Well, it turns out that you need to specify a delimiter instead of the default delimiter of ;. So here's a working version of the levenstein distance function, courtesy of CodeJanitor.
CREATE FUNCTION LEVENSHTEIN (s1 VARCHAR(255), s2 VARCHAR(255))
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = '\0', j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN SET c = c_temp; END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END//
Name parser links
Posted by Kelvin on 13 Apr 2011 | Tagged as: programming
I'm about to write some code to normalize names, e.g. split out firstName, middleName, lastName etc.
Here's some links on the topic:
http://search.cpan.org/dist/Lingua-EN-NameParse/lib/Lingua/EN/NameParse.pm
http://alphahelical.com/code/misc/nameparse/nameparse.php.txt
http://jasonpriem.com/human-name-parse/
http://code.google.com/p/php-name-parser/
http://www.onlineaspect.com/2009/08/17/splitting-names/
Preventing Java XML Parsers from resolving external DTDs
Posted by Kelvin on 07 Apr 2011 | Tagged as: programming
With some SAX parsers you can disable loading of external DTDs with this:
Not all do, however. Piccolo, for one, does not.
However, you can accomplish the same thing with this:
reader.setEntityResolver(new EntityResolver(){
public InputSource resolveEntity(String publicId, String systemId) throws SAXException, IOException {
return new InputSource(new StringReader(""));
}
});

