Grue’s programming blog. Mostly Common Lisp and Scheme related. Maybe Python too.READ EVAL PRINTTumblr (3.0; @readevalprint)https://readevalprint.tumblr.com/Ichiran@home 2021: the ultimate guide<p>Recently I’ve been contacted by several people who wanted to use my Japanese text segmenter <a href="https://github.com/tshatrov/ichiran/">Ichiran</a> in their own projects. This is not surprising since it’s vastly superior to <a href="https://taku910.github.io/mecab/">Mecab</a> and similar software, and is occassionally updated with new vocabulary unlike many other segmenters. Ichiran powers <a href="https://ichi.moe/">ichi.moe</a> which is a very cool webapp that helped literally dozens of people learn Japanese.</p>
<p>A big obstacle towards the adoption of Ichiran is the fact that it’s written in Common Lisp and people who want to use it are often unfamiliar with this language. To fix this issue, I’m now providing a way to build Ichiran as a command line utility, which could then be called as a subprocess by scripts in other languages.</p>
<p>This is a master post how to get Ichiran installed and how to use it for people who don’t know any Common Lisp at all. I’m providing instructions for Linux (Ubuntu) and Windows, I haven’t tested whether it works on other operating systems but it probably should.</p>
<h3>PostgreSQL</h3>
<p>Ichiran uses a PostgreSQL database as a source for its vocabulary and other things. On Linux install <code>postgresql</code> using your preferred package manager. On Windows use <a href="https://www.enterprisedb.com/downloads/postgres-postgresql-downloads">the official installer</a>. You should remember the password for the <code>postgres</code> user, or create a new user if you know how to do it.</p>
<p>Download the <a href="https://github.com/tshatrov/ichiran/releases">latest release</a> of Ichiran database. On the release page there are commands needed to restore the dump. On Windows they don’t really work, instead try to create database and restore the dump using pgAdmin (which is usually installed together with Postgres). Right-click on PostgreSQL/Databases/postgres and select “Query tool…”. Paste the following into Query editor and hit the Execute button.</p>
<pre><code>CREATE DATABASE [database_name]
WITH TEMPLATE = template0
OWNER = postgres
ENCODING = 'UTF8'
LC_COLLATE = 'Japanese_Japan.932'
LC_CTYPE = 'Japanese_Japan.932'
TABLESPACE = pg_default
CONNECTION LIMIT = -1;
</code></pre>
<p>Then refresh the Databases folder and you should see your new database. Right-click on it then select “Restore”, then choose the file that you downloaded (it wants “.backup” extension by default so choose “Format: All files” if you can’t find the file).</p>
<p>You might get a bunch of errors when restoring the dump saying that “user ichiran doesn’t exist”. Just ignore them.</p>
<h3>SBCL</h3>
<p>Ichiran uses <a href="http://sbcl.org/">SBCL</a> to run its Common Lisp code. You can download Windows binaries for SBCL 2.0.0 from the official site, and on Linux you can use the package manager, or also use binaries from the <a href="http://sbcl.org/platform-table.html">official site</a> although they might be incompatible with your operating system.</p>
<p>However you really want the latest version 2.1.0, especially on Windows for uh… <a href="https://bugs.launchpad.net/sbcl/+bug/1907970">reasons</a>. There’s a <a href="https://bugs.launchpad.net/sbcl/+bug/1907970/comments/6">workaround</a> for Windows 10 though, so if you don’t mind turning on that option, you can stick with SBCL 2.0.0 really.</p>
<p>After installing <em>some</em> version of SBCL (SBCL requires SBCL to compile itself), download the source code of the latest version and let’s get to business.</p>
<p>On Linux it should be easy, just run</p>
<pre><code>sh make.sh --fancy
sudo sh install.sh
</code></pre>
<p>in the source directory.</p>
<p>On Windows it’s somewhat harder. Install <a href="https://www.msys2.org/">MSYS2</a>, then run “MSYS2 MinGW 64-bit”.</p>
<pre><code>pacman -S mingw-w64-x86_64-toolchain make
# for paths in MSYS2 replace drive prefix C:/ by /c/ and so on
cd [path_to_sbcl_source]
export PATH="$PATH:[directory_where_sbcl.exe_is_currently]"
# check that you can run sbcl from command line now
# type (sb-ext:quit) to quit sbcl
sh make.sh --fancy
unset SBCL_HOME
INSTALL_ROOT=/c/sbcl sh install.sh
</code></pre>
<p>Then edit Windows environment variables so that <code>PATH</code> contains <code>c:\sbcl\bin</code> and <code>SBCL_HOME</code> is <code>c:\sbcl\lib\sbcl</code> (replace <code>c:\sbcl</code> here and in <code>INSTALL_ROOT</code> with another directory if applicable). Check that you can run a normal Windows shell (<code>cmd</code>) and run <code>sbcl</code> from it.</p>
<h3>Quicklisp</h3>
<p>Quicklisp is a library manager for Common Lisp. You’ll need it to install the dependencies of Ichiran. Download <code>quicklisp.lisp</code> from the <a href="https://www.quicklisp.org/beta/">official site</a> and run the following command:</p>
<pre><code>sbcl --load /path/to/quicklisp.lisp
</code></pre>
<p>In SBCL shell execute the following commands:</p>
<pre><code>(quicklisp-quickstart:install)
(ql:add-to-init-file)
(sb-ext:quit)
</code></pre>
<p>This will ensure quicklisp is loaded every time SBCL starts.</p>
<h3>Ichiran</h3>
<p>Find the directory <code>~/quicklisp/local-projects</code> (<code>%USERPROFILE%\quicklisp\local-projects</code> on Windows) and <code>git clone</code> <a href="https://github.com/tshatrov/ichiran/">Ichiran source code</a> into it. It is possible to place it into an arbitrary directory, but that requires <a href="https://common-lisp.net/project/asdf/asdf.html#Configuring-ASDF">configuring ASDF</a>, while <code>~/quicklisp/local-projects/</code> should work out of the box, as should <code>~/common-lisp/</code> but I’m not sure about Windows equivalent for this one.</p>
<p>Ichiran wouldn’t load without <code>settings.lisp</code> file which you might notice is absent from the repository. Instead, there’s a <code>settings.lisp.template</code> file. Copy <code>settings.lisp.template</code> to <code>settings.lisp</code> and edit the following values in <code>settings.lisp</code>:</p>
<ul>
<li><code>*connection*</code> this is the main database connection. It is a list of at least 4 elements: <em>database name</em>, <em>database user</em> (usually “postgres”), <em>database password</em> and <em>database host</em> (“localhost”). It can be followed by options like <code>:port 5434</code> if the database is running on a non-standard port.</li>
<li><code>*connections*</code> is an optional parameter, if you want to switch between several databases. You can probably ignore it.</li>
<li><code>*jmdict-data*</code> this should be a path to <a href="https://gitlab.com/yamagoya/jmdictdb/-/tree/master/jmdictdb/data">these files</a> from JMdict project. They contain descriptions of parts of speech etc.</li>
<li>ignore all the other parameters, they’re only needed for creating the database from scratch</li>
</ul>
<p>Run <code>sbcl</code>. You should now be able to load Ichiran with</p>
<pre><code>(ql:quickload :ichiran)
</code></pre>
<p>On the first run, run the following command. It should also be run after downloading a new database dump and updating Ichiran code, as it fixes various issues with the original JMdict data.</p>
<pre><code>(ichiran/mnt:add-errata)
</code></pre>
<p>Run the test suite with</p>
<pre><code>(ichiran/test:run-all-tests)
</code></pre>
<p>If not all tests pass, you did something wrong! If none of the tests pass, check that you configured the database connection correctly. If all tests pass, you have a working installation of Ichiran. Congratulations!</p>
<p>Some commands that can be used in Ichiran:</p>
<ul>
<li><code>(ichiran:romanize "一覧は最高だぞ" :with-info t)</code> this is basically a text-only equivalent of <a href="https://ichi.moe/">ichi.moe</a>, everyone’s favorite webapp based on Ichiran.</li>
<li><code>(ichiran/dict:simple-segment "一覧は最高だぞ")</code> returns a list of <code>WORD-INFO</code> objects which contain a lot of <a href="https://github.com/tshatrov/ichiran/blob/master/dict.lisp#L1224">interesting data</a> which is available through “accessor functions”. For example <code>(mapcar 'ichiran/dict:word-info-text (ichiran/dict:simple-segment "一覧は最高だぞ")</code> will return a list of separate words in a sentence.</li>
<li><code>(ichiran/dict:dict-segment "一覧は最高だぞ" :limit 5)</code> like <code>simple-segment</code> but returns top 5 segmentations.</li>
<li><code>(ichiran/dict:word-info-from-text "一覧")</code> gets a <code>WORD-INFO</code> object for a specific word.</li>
<li><code>ichiran/dict:word-info-str</code> converts a <code>WORD-INFO</code> object to a human-readable string.</li>
<li><code>ichiran/dict:word-info-gloss-json</code> converts a <code>WORD-INFO</code> object into a “json” “object” containing dictionary information about a word, which is not really JSON but an equivalent Lisp representation of it. But, it can be converted into a real JSON string with <code>jsown:to-json</code> function. Putting it all together, the following code will convert the word <code>一覧</code> into a JSON string:</li>
</ul>
<pre><code>(jsown:to-json
(ichiran/dict:word-info-json
(ichiran/dict:word-info-from-text "一覧")))
</code></pre>
<p>Now, if you’re not familiar with Common Lisp all this stuff might seem confusing. Which is where <code>ichiran-cli</code> comes in, a brand new Command Line Interface to Ichiran.</p>
<h3>ichiran-cli</h3>
<p><code>ichiran-cli</code> is just a simple command-line application that can be called by scripts just like <code>mecab</code> and its ilk. The main difference is that it must be built by the user, who has already did the previous steps of the Ichiran installation process. It needs to access the postgres database and the connection settings from <code>settings.lisp</code> are currently “baked in” during the build. It also contains a cache of some database references, so modifying the database (i.e. updating to a newer database dump) without also rebuilding <code>ichiran-cli</code> is highly inadvisable.</p>
<p>The build process is very easy. Just run <code>sbcl</code> and execute the following commands:</p>
<pre><code>(ql:quickload :ichiran/cli)
(ichiran/cli:build)
</code></pre>
<p>sbcl should exit at this point, and you’ll have a new <code>ichiran-cli</code> (<code>ichiran-cli.exe</code> on Windows) executable in <code>ichiran</code> source directory. If sbcl didn’t exit, try deleting the old <code>ichiran-cli</code> and do it again, it seems that on Linux sbcl sometimes can’t overwrite this file for some reason.</p>
<p>Use <code>-h</code> option to show how to use this tool. There will be more options in the future but at the time of this post, it prints out the following:</p>
<pre><code>>ichiran-cli -h
Command line interface for Ichiran
Usage: ichiran-cli [-h|--help] [-e|--eval] [-i|--with-info] [-f|--full] [input]
Available options:
-h, --help print this help text
-e, --eval evaluate arbitrary expression and print the result
-i, --with-info print dictionary info
-f, --full full split info (as JSON)
By default calls ichiran:romanize, other options change this behavior
</code></pre>
<p>Here’s the example usage of these switches</p>
<ul>
<li><code>ichiran-cli "一覧は最高だぞ"</code> just prints out the romanization</li>
<li><code>ichiran-cli -i "一覧は最高だぞ"</code> - equivalent of <code>ichiran:romanize :with-info t</code> above</li>
<li><code>ichiran-cli -f "一覧は最高だぞ"</code> - outputs the full result of segmentation as JSON. This is the one you’ll probably want to use in scripts etc.</li>
<li><code>ichiran-cli -e "(+ 1 2 3)"</code> - execute arbitrary Common Lisp code… yup that’s right. Since this is a new feature, I don’t know yet which commands people really want, so this option can be used to execute any command such as those listed in the previous section.</li>
</ul>
<p>By the way, as I mentioned before, on Windows SBCL prior to 2.1.0 doesn’t parse non-ascii command line arguments correctly. Which is why I had to include a section about building a newer version of SBCL. However if you use Windows 10, there’s a workaround that avoids having to build SBCL 2.1.0. Open “Language Settings”, find a link to “Administrative language settings”, click on “Change system locale…”, and turn on “Beta: Use Unicode UTF-8 for worldwide language support”. Then reboot your computer. Voila, everything will work now. At least in regards to SBCL. I can’t guarantee that other command line apps which use locales will work after that.</p>
<p>That’s it for now, hope you enjoy playing around with Ichiran in this new year. よろしくおねがいします!</p>https://readevalprint.tumblr.com/post/639359547843215360https://readevalprint.tumblr.com/post/639359547843215360Mon, 04 Jan 2021 01:12:59 -0500common lispjapaneselanguagePreviewing images in and out of SLIME REPL<p>As any Common Lisp coder knows, a REPL is an incredibly useful tool. It can be used not just for development, but for running all sorts of tasks. Personally, I don’t bother making my Lisp tools into executable scripts and just run them directly from SLIME. As such, any operation that requires leaving the REPL is quite inconvenient. For me, one such operation was viewing image files, for example in conjunction with my <a href="https://readevalprint.tumblr.com/post/163569150438/your-personal-diy-image-search"><code>match-client:match</code></a> tool. So lately I’ve been researching various methods to incorporate this functionality into the normal REPL workflow. Below, I present 3 methods that can be used to achieve this.</p>
<h3>Open in external program</h3>
<p>This one’s easy. When you want to view a file, launch an external process with your favorite image viewer. On Windows a shell command consisting of the image filename would launch the associated application, on Linux it’s necessary to provide the name of the image viewer.</p>
<pre><code>(defvar *image-app* nil) ;; set it to '("eog") or something
(defun view-file-native (file)
(let ((ns (uiop:native-namestring file)))
(uiop:launch-program (if *image-app*
(append *image-app* (list ns))
(uiop:escape-shell-token ns)))))
</code></pre>
<p>Note that <code>uiop:launch-program</code> is used instead of <code>uiop:run-program</code>. The difference is that <code>launch-</code> is non-blocking - you can continue to work in your REPL while the image is displayed, whereas <code>run-</code> will not return until you close the image viewer.</p>
<p>Also note that when the first argument to run/launch-program is a string, it is not escaped, so I have to do it manually. And if the first argument is a list, it <em>must</em> be a program and a list of its arguments, so merely using <code>(list ns)</code> wouldn’t work on Windows.</p>
<h3>Inline image in REPL</h3>
<p>The disadvantage of the previous method is that the external program might steal focus, appear on top of your REPL and disrupt your workflow. And it’s well known that Emacs can do everything, including viewing images, so why not use that?</p>
<p>In fact, SLIME has a plugin specifically for displaying images in REPL, <a href="https://github.com/slime/slime/blob/master/contrib/slime-media.el">slime-media</a>. However it’s difficult to find any information on how to use it. Eventually I figured out that SWANK (SLIME’s CL backend) needs to send an event <code>:write-image</code> with appropriate arguments and slime-media’s handler will display it right in the REPL. The easiest way is to just send the file path. The second argument is the resulting image’s string value. If you copy-paste (sorry, “kill-yank”) it in the repl, it would act just like if you typed this string.</p>
<pre><code>(swank::send-to-emacs '(:write-image "/path/to/test.png" "test"))
</code></pre>
<p><img src="https://64.media.tumblr.com/05589051ee3cafb70c4144dd9e780236/da84ae7df29eeb1b-98/s540x810/0ab0c5f8c886fea15e09783dce0ba350207dd7ec.png" alt=""/></p>
<p>You can even send raw image data using this method. I don’t have anything on hand to generate raw image data so here’s some code that reads from a file, converts it to a base64 string and sends it over SWANK.</p>
<pre><code>(with-open-file (in "/path/to/test.png" :direction :input :element-type '(unsigned-byte 8))
(let* ((arr (make-array (file-length in) :element-type '(unsigned-byte 8)))
(b64 (progn (read-sequence arr in) (cl-base64:usb8-array-to-base64-string arr))))
(swank::send-to-emacs `(:write-image ((:data ,b64 :type swank-io-package::png)) "12345"))))
</code></pre>
<p>Note that the first argument to <code>:write-image</code> must be a list with a single element, which is itself a plist containing <code>:data</code> and <code>:type</code> keys. <code>:data</code> must be a base64-encoded raw image data. <code>:type</code> must be a symbol in <code>swank-io-package</code>. It’s not exactly convenient, so if you’re going to use this functionality a helper function/macro might be necessary.</p>
<h3>Image in a SLIME popup buffer</h3>
<p>Inline images are not always convenient. They can’t be resized, and will take up as much space as is necessary to display them. Meanwhile EMACS itself has a built-in image viewer (<code>image-mode</code>) which can fit images to width or height of a buffer. And SLIME has a concept of a “popup buffer” which is for example used by macroexpander (<code>C-c C-m</code>) to display the result of a macro expansion in a separate window.</p>
<p>Interestingly, <code>slime-media.el</code> defines an event <code>:popup-buffer</code> but it seems impossible to trigger it from SWANK. It is however a useful code reference for how to create the popup buffer in ELisp. This time we won’t bother with “events” and just straight up execute some ELisp code using <code>swank::eval-in-emacs</code>. However by default, this feature is disabled on Emacs-side, so you’ll have to set Emacs variable <code>slime-enable-evaluate-in-emacs</code> to <code>t</code> in order for this method to work.</p>
<p>Also Emacs must be compiled with ImageMagick for the resizing functionality to work.</p>
<p>Anyway, the code to view file in the popup buffer looks like this:</p>
<pre><code>(defun view-file-slime (file &key (bufname "*image-viewer*"))
(let ((ns (namestring file)))
(swank::eval-in-emacs
`(progn
(slime-with-popup-buffer (,bufname :connection t :package t)
(insert-image (create-image ,ns))
(image-mode)
(setf buffer-file-name ,ns)
(not-modified)
(image-toggle-display-image))
;; try to resize the image after the buffer is displayed
(with-current-buffer ,bufname (image-toggle-display-image))))))
))
</code></pre>
<p>Arriving to this solution has required reading <a href="https://github.com/emacs-mirror/emacs/blob/master/lisp/image-mode.el">image-mode’s source code</a> to understand what exactly makes <code>image-mode</code> behave just like if the image file was opened in Emacs via <code>C-x C-f</code>. First off, <code>image-mode</code> can be a major and a minor mode - and the minor mode is not nearly as useful. <code>slime-with-popup-buffer</code> has a <code>:mode</code> keyword argument but it would cause <code>image-mode</code> to be set before the image is inserted, and it will be a minor mode in this case! Therefore <code>(image-mode)</code> must be called after <code>insert-image</code>.</p>
<p>Next, the buffer must satisfy <a href="https://github.com/emacs-mirror/emacs/blob/master/lisp/image-mode.el#L783-L799">several conditions</a> in order to get image data from the filename and not from the buffer itself. Technically it shouldn’t be necessary, but I couldn’t get auto resizing to work when <code>data-p</code> is true. So I set <code>buffer-file-name</code> to image’s filename and set <code>not-modified</code> flag on.</p>
<p>Next, <code>image-toggle-display-image</code> is called to possibly resize the image according to image-mode settings. It’s called outside of <code>slime-with-popup-buffer</code> for the following reason: the buffer might not yet be visible and have any specific dimensions assigned to it, and therefore resizing will do nothing.</p>
<p>Here’s an example of how calling this function looks in Emacs.</p>
<p><img src="https://64.media.tumblr.com/6b1203dd3f9ca024da87410f198109f5/da84ae7df29eeb1b-ba/s540x810/2b96e4c02aed99c2450a43be59fa1d31a6f34483.png" alt=""/></p>
<p>The position of the popup buffer depends on whether the original Emacs window is wide enough or not. I think it looks better when it’s divided vertically. Use <code>M-x image-transform-fit-to-height</code> or <code>M-x image-transform-fit-to-width</code> to set up the auto-resizing method (it gets remembered for future images). Unfortunately there’s no way to fit both height and width, at least with vanilla Emacs. I prefer fit-to-width because in case the image is too tall, it is possible to scroll the image vertically with <code>M-PgDn</code> and <code>M-PgUp</code> from the other buffer. Unlike other <code>image-mode</code> buffers, this buffer supports a shortcut <code>q</code> to close itself, as well as various SLIME shortcuts, for example <code>C-c C-z</code> to return to the REPL.</p>
<p>That’s it for now, hope you enjoyed this overview and if you happen to know a better way to display images in Emacs, I would be interested to hear about it.</p>https://readevalprint.tumblr.com/post/617089509006491648https://readevalprint.tumblr.com/post/617089509006491648Sun, 03 May 2020 06:40:15 -0400common lispemacslispprogrammingMulti-threaded testing with lisp-unit and lparallel<p>There’s no shortage of testing libraries in Common Lisp world. In fact the first library a Lisp developer writes is usually a unit testing library. Everyone has their favorite as well.</p>
<p>For my project, <a href="https://github.com/tshatrov/ichiran">Ichiran</a> I had to start testing early on, mainly to keep track of regressions of the segmentation algorithm. One of the main features of this project is the ability to split a sentence of Japanese text into separate words (there are no spaces), which it does better than other known algorithms.</p>
<p>For the testing framework, I chose <a href="https://github.com/OdonataResearchLLC/lisp-unit">lisp-unit</a>. I think it’s considered somewhat old-fashioned these days, and it certainly doesn’t have a lot of fancy features, but its dead simplicity is also an advantage.</p>
<p>From the start, the testing procedure was simple. I had an <code>assert-segment</code> macro which basically compared two lists of strings, and a test <code>segmentation-test</code> which had a lot of <code>assert-segment</code>s in it.</p>
<pre><code>(defmacro assert-segment (str &rest segmentation)
`(assert-equal ',segmentation (simple-segment ,str)))
(define-test segmentation-test
(assert-segment "ご注文はうさぎですか" "ご注文" "は" "うさぎ" "です" "か")
...
)
</code></pre>
<p>At some point I had enough of these test segmentations that the test took a while to finish. After using <a href="https://lparallel.org/">lparallel</a> in <a href="https://readevalprint.tumblr.com/post/163569150438/your-personal-diy-image-search">another project</a> I got an idea that segmentation tests can be parallelized in a similar way, without changing the testing framework. Eventually I had an implementation that was “decent enough”, however it had some flaws that I just kept ignoring. Recently I went about fixing these flaws, and so I present the proper way to have tests running in parallel in lisp-unit. A similar approach might also work in other test frameworks.</p>
<p>Let’s start with the <code>assert-segment</code> macro above. It can be divided into two parts. The first part is <code>(assert-equal expected-result actual-result)</code> which is fast, and has to run in the main thread (because it modifies lisp-unit’s special variables and can’t be assumed to be thread-safe). The second part is <code>(simple-segment str)</code> which produces <code>actual-result</code>, is relatively slow, and we want to run it in the thread pool. Now we have the following criteria:</p>
<ol>
<li><code>simple-segment</code> must be computed before it can be decided whether the test passes.</li>
<li><code>assert-equal</code> must be performed in the main thread.</li>
<li><code>assert-segment</code> must be placed sequentially in the body of <code>segmentation-test</code>.</li>
</ol>
<p>At first it might seem impossible to satisfy these criteria while also having <code>simple-segment</code> execute in parallel. After all, each <code>assert-segment</code> expands to <code>assert-equal</code> and that can’t evaluate until <code>simple-segment</code> returns the result of segmentation.</p>
<p>To resolve this apparent contradiction, we will use lparallel’s concept of <a href="https://lparallel.org/promises/">“promises”</a>. A <em>promise</em> is an object that indicates a result of some computation, which might have not finished, or not even started. To actually get the result from promise, we must <em>force</em> it, which will wait for the computation to finish. It’s an extremely powerful concept, though perhaps not very intuitive.</p>
<p>We will use two types of promises: <em>future</em> starts running some computation in lparallel’s thread pool when it’s created. <em>delay</em> starts running only when it is <em>forced</em> - it’s basically identical to creating a closure and funcalling it later.</p>
<p>Here’s how our test will work. Each <code>assert-segment</code> will create a <code>simple-segment</code> <em>future</em> which would start executing in the thread pool, and <code>assert-equal</code> <em>delay</em> which would <em>force</em> this future. The delays will only be forced at the end of the test, at which point all the segmentations are already scheduled to be executed. Here’s the outline of the test:</p>
<pre><code>(defvar *delays*)
(defmacro define-parallel-test (name &body body)
`(define-test ,name
(let ((*delays* nil))
,@body
(map nil 'lparallel:force (reverse *delays*)))))
(define-parallel-test segmentation-test
(assert-segment ...)
(assert-segment ...)
(assert-segment ...)
...
)
</code></pre>
<p>Each <code>assert-segment</code> will have to <code>push</code> its delay into a special variable <code>*delays*</code>, and then they will be waited for in the same order they were pushed (which requires reversing the list of delays).</p>
<p>Before I define <code>assert-segment</code>, I would like to abstract the concept of it. Let’s create a generic macro <code>test-job</code> that would define one of the jobs that need to be executed during a test. Each <em>test-job</em> will have a <em>tester</em> (the part that executes in the main thread) and a <em>worker</em> (the part that executes in the thread pool).</p>
<pre><code>(defun test-progress (result &optional err)
(princ (cond
(err "E")
(result ".")
(t "F"))))
(defmacro test-job ((result-var) tester &body worker)
(alexandria:with-gensyms (future error)
`(push
(let ((,future (lparallel:future (ignore-errors ,@worker))))
(lparallel:delay
(multiple-value-bind (,result-var ,error) (lparallel:force ,future)
(cond (,error
(test-progress nil t)
(error ,error))
(t (test-progress ,tester))))))
*delays*)))
</code></pre>
<p>The code is a bit nasty. First off, to keep track of completed tests I added a <code>test-progress</code> function. It will print a small dot for a successful test and a big letter if something went wrong.</p>
<p>The future is created with <code>lparallel:future</code>. You might notice that the worker code is wrapped with <code>ignore-errors</code>. However I’m not really ignoring them - the condition object is returned as the second value of <code>ignore-errors</code>, and <code>lparallel:future</code> thankfully does preserve multiple values. In fact, this error is immediately re-raised after printing <code>test-progress</code>. Remember that the code within <code>lparallel:future</code> is executed in a thread pool. Which means the error handlers set up by lisp-unit in the main thread are not handling any errors in that code! Which means if the test code is producing an error, you get a faceful of debugger windows in SLIME, one for each test case, which are quite hard to get rid of! <code>ignore-errors</code> solves this problem by transferring the error object into the main thread, where it is gracefully handled (and accounted for) by lisp-unit.</p>
<p>Note that it is critical that <code>lparallel:future</code> is created outside of <code>lparallel:delay</code>. Otherwise the computation won’t start until the delay is forced, which will defeat the whole point.</p>
<p>Finally, <code>assert-segment</code> will look like this:</p>
<pre><code>(defun assert-segment (str &rest segmentation)
(test-job (result) (assert-equal segmentation result)
(simple-segment str)))
</code></pre>
<p>Oh right, one last thing. Now that this test uses a lparallel thread pool, it cannot be run with a mere <code>lisp-unit:run-tests</code>! A test runner must create lparallel’s <em>kernel</em> and also gracefully shut it down. Here’s an example:</p>
<pre><code>(defparameter *test-thread-count* 4)
(defun run-parallel-tests (&optional (tests :all) (pkg :ichiran/test))
(let* ((lparallel:*kernel* (lparallel:make-kernel *test-thread-count*))
(res (unwind-protect
(run-tests tests pkg)
(lparallel:end-kernel))))
(print-failures res)
(print-errors res)
res))
</code></pre>
<p>I hope this post has shed some light on using lparallel’s promises abstractions to execute parallel tasks. I think it’s the most natural interface to do such a thing, and as you can see it allows to transform seemingly sequential code into what really is multithreaded code. In this case it allowed to trick a single-threaded library like lisp-unit into running several jobs in parallel. And that’s it for today’s post, see you next time!</p>https://readevalprint.tumblr.com/post/615778363073806336https://readevalprint.tumblr.com/post/615778363073806336Sat, 18 Apr 2020 19:20:09 -0400common lisplispprogrammingHow caches work in Ichiran<p>Welcome to another update of very frequently updating (not!) web blog READ EVAL PRINT.</p>
<p>Recently I had some free time to work on my <a href="https://github.com/tshatrov/ichiran">Ichiran</a> / <a href="http://ichi.moe">ichi.moe</a> project, which I had written about in <a href="https://readevalprint.tumblr.com/post/101841449553/its-alive-the-path-from-library-to-web-app">previous posts</a> here. A new feature has been developed to load various extra data in the dictionary, and as a proof of concept I added all city/town/village names in Japan to it.</p>
<p>Also I’ve taken a stab at optimizing some frequently running code. The single biggest (both in run-time and code size) function <code>calc-score</code> has been optimized to <code>(speed 3)</code> with all optimization suggestions taken care of.</p>
<p>The profiler has also identified a function named <code>get-conj-data</code> as another time-waster. This function is one of many called by the abovementioned <code>calc-score</code> to calculate the score of a word. The reason why it’s so wasteful is that it is called for every single word (often twice), and it performs a database query on every single call.</p>
<p>After I fixed the being-called-twice thing, the other possible optimization was clearly to avoid querying the database so much. As the name implies, <code>get-conj-data</code> gets conjugation metadata of a word. For example, the word <em>starts</em> in English is a plural form of the noun <em>start</em>, while also being a third-person form of the verb <em>to start</em>. In Japanese, there’s a lot more verb conjugations than in English, so most words in Ichiran database are actually verb conjugations.</p>
<p>Still, in a given sentence there will only be a few conjugatable words and yet <code>get-conj-data</code> is called on every word. If only we could know in advance that a given word doesn’t have any conjugations… And this is where the so-called “caching” comes in.</p>
<p>Ichiran database is mostly read-only, so there are a lot of opportunities to reduce database queries, though these opportunities are often unused. I have created a simple caching interface located in the file <a href="https://github.com/tshatrov/ichiran/blob/master/conn.lisp">conn.lisp</a> which is full of esoteric Lisp goodness such as <a href="https://readevalprint.tumblr.com/post/145348323128/all-you-need-is-progv">a macro involving PROGV</a>. Let’s look at the implementation:</p>
<pre><code>(defclass cache ()
((mapping :initform nil :allocation :class)
(name :initarg :name :reader cache-name)
(var :initarg :var :reader cache-var)
(lock :reader cache-lock)))
</code></pre>
<p>Each cache will be an instance of this class. However the data is not held in a slot, but rather in a special variable referred to by <code>var</code> slot. There’s also a <code>lock</code> slot because I don’t want to have one cache loading the same data in two threads at once. The slot <code>mapping</code> is especially interesting: it has <code>:allocation :class</code> so all caches have the same “mapping”. This mapping associates <code>name</code> of a cache (which is a keyword) to an actual instance of <code>cache</code>. This allows to refer to a cache instance from anywhere in the code, knowing only its name.</p>
<pre><code>(defmethod initialize-instance :after ((cache cache) &key &allow-other-keys)
(let* ((name (cache-name cache))
(old-cache (getf (slot-value cache 'mapping) name)))
(setf (slot-value cache 'lock)
(if old-cache
(cache-lock old-cache)
(sb-thread:make-mutex :name (symbol-name name))))
(setf (getf (slot-value cache 'mapping) name) cache)))
</code></pre>
<p>You can see there’s some code dealing with “old cache”. When a cache definition (using <code>defcache</code> below) is recompiled, <code>make-instance</code> will be used to create a new cache object. To avoid having to repopulate the cache, the data is not stored in the cache instance, but in a separate special variable. The cache lock is also transferred from the old cache to prevent any weirdness when the cache is initialized when the old cache is still ingesting data.</p>
<pre><code>(defun all-caches ()
(slot-value (sb-mop:class-prototype (find-class 'cache)) 'mapping))
(defun get-cache (name)
(getf (all-caches) name))
</code></pre>
<p>To get a cache instance by name, we need to access the <code>mapping</code> slot. But is it possible to access it without having any particular cache instance at hand? It turns out yes, but only by using the extra MOP functionality. You can use <code>closer-mop:class-prototype</code> for portable code.</p>
<pre><code>(defmacro defcache (name var &body init-body)
(alexandria:with-gensyms (cache-var)
`(progn
(def-conn-var ,var nil)
(make-instance 'cache :name ',name :var ',var)
(defmethod init-cache ((,cache-var (eql ,name)))
(with-connection *connection*
,@init-body)))))
</code></pre>
<p>Here’s what <code>defcache</code> looks like. <code>def-conn-var</code> is explained in the <a href="https://readevalprint.tumblr.com/post/145348323128/all-you-need-is-progv">PROGV post</a>. It’s basically a special variable that gets swapped depending on the current connection. <code>make-instance</code> is called at top-level here, the only reason it’s not garbage collected is because the class-allocated <code>mapping</code> slot refers to it. <code>init-cache</code> is a custom method to load all data into the cache. It’s specialized on the name keyword which is somewhat unusual.</p>
<pre><code>(defgeneric reset-cache (cache-name)
(:method (cache-name)
(let ((val (init-cache cache-name))
(cache (get-cache cache-name)))
(sb-thread:with-mutex ((cache-lock cache))
(setf (symbol-value (cache-var cache)) val)))))
(defgeneric ensure (cache-name)
(:method (cache-name)
(let ((cache (get-cache cache-name)))
(or (symbol-value (cache-var cache))
(sb-thread:with-mutex ((cache-lock cache))
(or (symbol-value (cache-var cache))
(let ((val (init-cache cache-name)))
(setf (symbol-value (cache-var cache)) val))))))))
(defun init-all-caches (&optional reset)
(loop with fn = (if reset 'reset-cache 'ensure)
for (name . rest) on (all-caches) by #'cddr
do (funcall fn name)))
</code></pre>
<p>The rest of the implementation is fairly straightforward. <code>ensure</code> is the function that will be used as an accessor to cache, <code>init-all-caches</code> is called at startup, and <code>reset-cache</code> is mostly to be used interactively when I want to refresh a specific cache.</p>
<p>Note that <code>ensure</code> checks that cache is non-NIL again after receiving the lock: another thread could’ve filled the cache while it’s still waiting for the lock.</p>
<p>Note also that <code>init-all-caches</code> uses single-quote to refer to functions <code>'reset-cache</code> and <code>'ensure</code>, but uses <code>#'cddr</code> in the loop. As a rule, I almost always use a symbol as a function designator. The problem with <code>#'</code> is that it reads the function definition at evaluation time, and not when it’s actually called, so in a project like mine with a lot of global state, it might result in outdated function objects hanging around when I redefine some functions on the fly. However using single-quote function designator in the <code>by</code> clause of <code>loop</code> leads to the following warning in SBCL:</p>
<pre><code>WARNING: Use of QUOTE around stepping function in LOOP will be left verbatim.
</code></pre>
<p>Why is this a warning, I have no idea. The code works, and my use of <code>quote</code> is totally intentional. So I’m using <code>#'</code> here just to avoid this warning. But since built-in functions like <code>cddr</code> can’t be redefined anyway, it doesn’t really matter.</p>
<p>EDIT: I have received response from one of SBCL developers, stassats, and apparently this warning is a relic from an old era and will be removed in the next version. Also apparently what I wrote about <code>#'</code> in the first version of this article was wrong, so now I rewrote my justification for not using it.</p>
<p>Now let’s look at how a specific cache can be defined and used:</p>
<pre><code>(defcache :no-conj-data *no-conj-data*
(let ((no-conj-data (make-hash-table :size 200000)))
(dolist (seq (query (:select (:distinct 'entry.seq) :from 'entry
:left-join (:as 'conjugation 'c) :on (:= 'entry.seq 'c.seq)
:where (:and (:is-null 'c.seq))) :column))
(setf (gethash seq no-conj-data) t))
no-conj-data))
(defun no-conj-data (seq)
(nth-value 1 (gethash seq (ensure :no-conj-data))))
(defun get-conj-data (seq &optional from/conj-ids texts)
(when (no-conj-data seq)
(return-from get-conj-data nil))
...
)
</code></pre>
<p><code>defcache</code> is used to define a custom loading procedure. In this case, the data is a set of sequence numbers, implemented as keys of a hash table. Because the majority of the words in the database <em>are</em> conjugations, it’s more memory-efficient to only store the entries that aren’t.</p>
<p>With the cache being present, get-conj-data can exit early if the sequence number is known to not be a conjugation of anything, avoiding an extra database query.</p>
<p>In the end, between <code>calc-score</code> optimization and <code>get-conj-data</code> optimization I achieved about 2x speedup of the test suite, and it will likely apply to real life performance as well.</p>
<p>Hopefully you enjoyed this writeup and hopefully I’ll write more sometime in the future. Until next time!</p>https://readevalprint.tumblr.com/post/615190303377424384https://readevalprint.tumblr.com/post/615190303377424384Sun, 12 Apr 2020 07:33:12 -0400common lisplispprogrammingichiranYour personal DIY image search<p>Hi everyone, it’s been a while! I bet you forgot this blog even existed. I happen to be a big supporter of quality over quantity, so while my work on <a href="https://www.reddit.com/r/LearnJapanese/comments/679kah/ichimoe_japanese_text_parsingdictionary_tool/">parsing Japanese counters</a> earlier this year was pretty interesting, I already <a href="http://readevalprint.tumblr.com/post/96866606738/my-little-transliterator-cant-be-this-clos">wrote</a> <a href="http://readevalprint.tumblr.com/post/97467849358/who-needs-graph-theory-anyway">way</a> <a href="http://readevalprint.tumblr.com/post/99568614213/words-made-out-of-words">too</a> <a href="http://readevalprint.tumblr.com/post/101841449553/its-alive-the-path-from-library-to-web-app">many</a> articles about <a href="https://github.com/tshatrov/ichiran">Ichiran</a>/<a href="http://ichi.moe/">ichi.moe</a> so I decided to keep it to myself. Recently I’ve been working on a little side-project and now that it finally <em>works</em>, I think it deserves a full-fledged blog post.</p>
<p>For a bit of a nostalgia trip, let’s go back to the early 00s. Remember when <a href="https://www.tineye.com/">TinEye</a> first appeared? It was amazing. For the first time you could easily find where that one image you once saved from some random phpBB forum is really from. It didn’t matter if your image was resized, or slightly edited from the original, it still worked. That shit was magic, my friends. Of course these days nobody is impressed by this stuff. <a href="https://images.google.com/">Google Image Search</a> indexes pretty much anything that exists on the Internet and even uses neural networks to identify content of an image.</p>
<p>Back to the present day. I discovered I have an image hoarding problem. Over the years of using <em>the Intertubes</em>, I have accumulated a massive number of images on my hard drive. When I see an image I like my first thought is “do I have this one saved already?” because how could I possibly remember? At this point I need <em>my own</em> personal Google Image Search. And <strong>(spoiler alert)</strong> now I have one.</p>
<p>First of all, I needed an actual image matching technology. These days <em>the cloud</em> is all the rage, so I definitely wanted to have this thing running in the cloud (as opposed to my local PC) so that I could search my images from anywhere in the world. After a cursory search, my eyes fell on a thing called <a href="https://github.com/pavlovai/match">Pavlov Match</a> which runs from a Docker container, so should be pretty easy to install. I installed <code>docker</code> and <code>docker-compose</code> on my VPS, and then git-cloned Match and ran <code>make dev</code> according to instructions. This will actually run an Elasticsearch instance on the same VPS, and apparently the damn thing eats memory for breakfast, at least with the default settings. I’m using a cheap 2GB RAM Linode, so the memory is actually a very finite resource here, as I will find out later. The default settings will also <strong>completely expose your match installation AND elasticsearch to the world</strong>. But don’t worry, I figured this out so that you don’t have to. Let’s edit <code>docker-compose.yml</code> from match repository as follows:</p>
<pre>
version: '2'
services:
match:
image: pavlov/match:latest
ports:
- 127.0.0.1:8888:8888
command: ["/wait-for-it.sh", "-t", "60", "elasticsearch:9200", "--", "gunicorn", "-b", "0.0.0.0:8888", "-w", "4", "--preload", "server:app"]
links:
- elasticsearch
elasticsearch:
image: elasticsearch
environment:
- "ES_JAVA_OPTS=-Xms256m -Xmx256m"
- bootstrap.mlockall=true
expose:
- "9200"
</pre>
<p>This will make match server only available on local network within the VPS on port 8888, and elasticsearch only available to these two docker containers. It will also restrict elasticsearch RAM consumption to 512mb and <code>--preload</code> flag reduces the amount of memory gunicorn workers consume.</p>
<p>To make match server available from outside I recommend proxying it through nginx or some other proper web server. You can also add authentication/IP whitelist in nginx because the match server has no authentication features whatsoever, so anyone will be able to search/add/delete the data on it.</p>
<p>That was the backend part. No programming required here! But this is a Lisp blog, so the next step is writing a Lisp client that can communicate with this server. The first step is reading the match <a href="https://github.com/pavlovai/match#api">API documentation</a>. You might notice it’s a bit… idiosyncratic. I guess REST is out of fashion these days. Anyway, I started implementing a client using the trusty <a href="http://weitz.de/drakma/">drakma</a>, but I quickly hit a limitation: match expects all parameters to be sent encoded as form data, but drakma can only encode POST parameters as form data and not, say, DELETE parameters. Not to be foiled by a badly designed API, I tried <a href="https://github.com/fukamachi/dexador">dexador</a>, and while <code>dex:delete</code> does not encode parameters as form data, <code>dex:request</code> is flexible enough to do so. Each response (a JSON string) is parsed using <a href="https://github.com/madnificent/jsown">jsown</a>.</p>
<pre>
(defun parse-request (&rest args)
(when *auth*
(setf args `(,@args :basic-auth ,*auth*)))
(multiple-value-bind (content return-code)
(handler-bind ((dex:http-request-failed #'dex:ignore-and-continue))
(apply 'dex:request args))
(cond
((<= 400 return-code 499)
(jsown:new-js
("status" "fail")
("error" content)
("code" return-code)))
(t (let ((obj (jsown:parse content)))
(jsown:extend-js obj ("code" return-code)))))))
(defun add-local (file &key path (metadata "{}"))
"Add local image to Match server"
(parse-request
(api-url "/add")
:method :post
:content `(("image" . ,(pathname file))
("filepath" . ,(or path file))
("metadata" . ,metadata))))
</pre>
<p>With this basic client in place, I can add and delete individual images, but it would be incredibly cumbersome to manage thousands of images with it. I had to <a href="https://github.com/tshatrov/match-client/blob/master/cache.lisp">write some code</a> that would scan specified directories for images, track any changes and then add/update/delete information from Match server as needed. I already wrote <a href="https://github.com/tshatrov/cl-sync/">something like this</a> before, so this was pretty easy. Of course SBCL’s “<code>sb-posix:stat</code> doesn’t work on Unicode filenames” bug has reared its head again, but <a href="http://readevalprint.tumblr.com/post/80560377648/getting-file-stats-for-unicode-filenames-in-sbcl">I already knew the workaround</a>. This time I completely relied on UIOP for recursively walking directories (<code>uiop:subdirectories</code> and <code>uiop:directory-files</code> are your friends). Each image file is represented as CLOS object and saved into a hash-table which is serialized to a file using CL-STORE. The object has a <code>status</code> attribute which can be <code>:new</code>, <code>:update</code>, <code>:delete</code>, <code>:ok</code> and so on. Based on status, an action needs to be performed, such as uploading an image to Match server (for <code>:new</code> and <code>:update</code>).</p>
<p>Now, I could just send a bunch of requests one after another, but that would be a waste. Remember, we have 4 gunicorn workers running on our server! This clearly calls for a thread pool. I thought <a href="http://marijnhaverbeke.nl/pcall/">PCALL</a> would be perfect for this, but nope. It uses <code>sb-thread:interrupt-thread</code> which is incredibly unsafe and the result is that you basically can’t safely make http requests from thread workers. Debugging this took way too much time. In the end, I implemented a thread pool based on <a href="https://lparallel.org/">lparallel</a> promises which is kind of an overkill for such a simple use case, but at least it worked.</p>
<pre>
(setf *cache* (update-cache))
(let ((lparallel:*kernel* (lparallel:make-kernel threads)))
(unwind-protect
(loop for value in (alexandria:hash-table-values *cache*)
collect (worker value) into futures
finally (map nil 'lparallel:force futures))
(lparallel:end-kernel)))
(save-cache *cache*))
</pre>
<p>Note that you must be very careful when doing things that affect global state inside the threads. For example <code>:delete</code> action removes a key from the hash table <code>*cache*</code>. This is not guaranteed to be an atomic operation, so it’s necessary to grab a global lock when doing it.</p>
<pre>
(defvar *cache-lock* (bordeaux-threads:make-lock "match-cache-lock"))
...
(bordeaux-threads:with-lock-held (*cache-lock*)
(remhash key *cache*))
</pre>
<p>Printing messages to REPL from inside threads also requires a separate lock and <code>(force-output)</code>, otherwise it will look like a complete mess!</p>
<pre>
(defun format-msg (str &rest args)
(bordeaux-threads:with-lock-held (*msg-lock*)
(terpri)
(apply 'format t str args)
(force-output)))
</pre>
<p>Now that the required functionality is implemented, it’s time to test upload a bunch of stuff… and get back a bunch of errors. It took some sleuthing to discover that gunicorn workers of my Match server are routinely getting killed by “OOM killer”. Basically, the server runs out of memory and the system in desperation kills a process that it doesn’t like. Remember, I only have 2Gb of memory there!</p>
<p>I figured out that it’s images with very large dimensions that are the most problematic in terms of memory usage. If I were to resize these images to some reasonable size, the matching should still work pretty well. In order to execute this plan, I thought I’d use some Lisp to ImageMagick interface. There’s in fact a pure Lisp solution called <a href="https://github.com/slyrus/opticl">OptiCL</a> but would it really handle any image? Remind me to test that later! Anyway, back to ImageMagick. Neither <a href="http://www.nil.at/software/lisp-magick.html">lisp-magick</a> nor <a href="https://github.com/TBRSS/lisp-magick-wand">lisp-magick-wand</a> would work with the most recent ImageMagick version (seems its API has changed a bit). However the last one I tried <a href="https://github.com/muyinliu/cl-graphicsmagick">cl-graphicsmagick</a>, which uses a fork of ImageMagick called GraphicsMagick, has unexpectedly worked (at least on my Windows laptop. Note that you need to install Microsoft Visual C Redistributable 2008 otherwise the library wouldn’t load with CFFI) so I went with that.</p>
<p>Using very useful temporary files functionality of UIOP (<code>uiop:with-temporary-file</code>), I resize each oversized image to reasonable dimensions and save into a temporary file, which is then uploaded to Match server. I also send the file’s original and resized dimensions as metadata. Thankfully this completely eradicated the memory issue. There’s a minor problem where GraphicsMagick cannot do Unicode pathnames on Windows, so I copy the original image into a temporary file with ASCII-only name in that case.</p>
<pre>
(defun resize-image (input-path output-path
&key (max-width *max-dimension*) (max-height *max-dimension*)
(filter :%QuadraticFilter) (blur 1))
(gm::with-magick-wand (wand)
(handler-case (gm::%MagickReadImage wand input-path)
;; graphicsmagick cannot read Unicode filenames on Windows so attempt to load a copy
(gm::magick-error ()
(uiop:with-temporary-file (:pathname tmp :prefix "gm" :type (pathname-type input-path))
(uiop:copy-file input-path tmp)
(setf wand (gm::%NewMagickWand))
(gm::%MagickReadImage wand (namestring tmp)))))
(let ((w (gm::%MagickGetImageWidth wand))
(h (gm::%MagickGetImageHeight wand))
(res nil))
(multiple-value-bind (fw fh) (gm::fit-width-height w h max-width max-height)
(unless (and (= w fw) (= h fh))
(gm::%MagickResizeImage wand fw fh filter blur)
(gm::%MagickWriteImage wand output-path)
(setf res output-path))
(values res w h fw fh)))))
</pre>
<p>Later I tested this code on an Ubuntu machine with GraphicsMagick installed from Apt repository and SBCL crashed into ldb debugger mode straight away… Welp. The helpful folks of <code>#lisp</code> told me the problem is with signal handlers established by GraphicsMagick library, somehow they confuse SBCL. Based on that advice, eventually I succeeded making this work. Uninstall apt Graphicsmagick and grab the sources. Find the file called <code>magick.c</code> and replace the line</p>
<pre>
InitializeMagickSignalHandlers(); /* Signal handlers */
</pre>
<p>with</p>
<pre>
// InitializeMagickSignalHandlers(); /* Signal handlers */
</pre>
<p>(commenting it out). Then do <code>configure --enable-shared</code> (see readme for possible options), <code>make</code> and <code>sudo make install</code>. This will make it work when called from SBCL on Linux.</p>
<p>Anyways, the full code of MATCH-CLIENT can be found at <a href="https://github.com/tshatrov/match-client">my Github</a>. It’s not installable from quicklisp for obvious reasons, in fact it’s a complete pain to install as you might’ve already guessed, but if you wanna try it, you’re welcome. The main two commands are <code>update</code> and <code>match</code>. The first is called to upload all images in your <code>*root-dirs*</code> to the server and then to update them if anything changes. <code>match</code> is used to match any image on the Internet (passed as URL string) or a local pathname (passed as pathname object) compared to the server. It returns a list of jsown objects (basically alists) that contain score (up to 100 for exact match), path (with “local tag” which can be different per device) and metadata containing original and resized dimensions.</p>
<pre>
((:OBJ ("score" . 96.00956)
("filepath" . "[HOME] d:/foo/bar/baz.jpg")
("metadata" :OBJ ("rw" . 1218) ("rh" . 2048) ("w" . 3413) ("h" . 5736))))
</pre>
<p>Anyway, this was a fun (although often frustrating) thing to build and ended up being quite useful! Thanks for reading and see you next time.</p>https://readevalprint.tumblr.com/post/163569150438https://readevalprint.tumblr.com/post/163569150438Sat, 29 Jul 2017 14:47:00 -0400common lispimage searchprogramminglispDisplaced arrays are, by the way, super useful, because they allow to have an efficient array slice operation (see RUTILS slice for an example implementation)<p>Well, I’ve been hearing conflicting information on whether they are actually <i>efficient</i>. Accessing an element of a displaced array introduces a new level of indirection as opposed to just passing :start and :end arguments along with the array. Because they aren’t used very often, the existing implementations might not be very efficient at handling them.</p><p>However in the case of Ichiran, there is a certain part where I have to iterate through every substring of a string. Creating a new string for every substring seemed wasteful to me, no matter how you slice it (pun intended). So I went with the following method: create <i>one</i> displaced array and <i>adjust</i> it to the new boundaries for every cycle. This seems, if not time, then at least memory efficient way to handle this task because only one new object is created instead of O(n²).</p><p>(relevant source code is <a href="https://github.com/tshatrov/ichiran/blob/master/dict.lisp#L749">here</a>)</p>https://readevalprint.tumblr.com/post/145393491813https://readevalprint.tumblr.com/post/145393491813Sat, 04 Jun 2016 03:14:40 -0400All you need is PROGV<p><em>I have never seen PROGV in use</em> – <a href="http://xach.com/naggum/articles/[email protected]">Erik Naggum</a></p>
<p><em>Common Lisp is very, very old. Tagbody and progv, anyone?</em> – Hacker News user <a href="https://news.ycombinator.com/item?id=973682">pwnstigator</a></p>
<p>I haven’t written anything on this blog lately, mostly because of lack of time to work on side projects and consequently the lack of Lisp things to talk about. However recently I’ve been working on various improvements to my <a href="https://github.com/tshatrov/ichiran">Ichiran</a> project, and here’s the story of how I came to use the much maligned (or rather, extremely obscure) special operator PROGV for the first time.</p>
<p>Ichiran is basically a glorified Japanese dictionary (used as the backend for the web app <a href="http://ichi.moe/">ichi.moe</a>) and it heavily depends on a Postgres database that contains all the words, definitions and so on. The database is based on a dump of an open <a href="http://www.edrdg.org/jmdict/j_jmdict.html">JMdict</a> dictionary, which is constantly updated based on the users’ submissions.</p>
<p>Well, the last time I generated the database from this dump was almost a year ago, and I wanted to update the definitions for a while. However this tends to break the accuracy of my word segmenting algorithm. For this reason I want to keep the old and the new database at the same time and be able to run the whatever code with either of the databases.</p>
<p>I’m using <a href="http://marijnhaverbeke.nl/postmodern/">Postmodern</a> to access the database, which has a useful macro named <code>with-connection</code>. If I have a special variable <code>*connection*</code> and consistently use <code>(with-connection *connection* ...)</code> in my database-accessing functions then I can later call</p>
<pre>
(let ((*connection* '("foo" "bar" "baz" "quux")))
(some-database-accessing-function))
</pre>
<p>and it will use connection <code>("foo" "bar" "baz" "quux")</code> instead of the default one. I can even encapsulate it as a macro</p>
<pre>
(defmacro with-db (dbid &body body)
`(let ((*connection* (get-spec ,dbid)))
(with-connection *connection*
,@body)))
</pre>
<p>(<em>dbid</em> and <em>get-spec</em> are just more convenience features, so that I can refer to the connection by a single keyword instead of a list of 4 elements).</p>
<p>So far so good, but there’s a flaw with this approach. For performance reasons, some of the data from the database is stored in certain global variables. For example I have a variable <code>*suffix-cache*</code> that contains a mapping between various word suffixes and objects in the database that represent these suffixes. Obviously if I run something with a different connection, I want to use <code>*suffix-cache*</code> that’s actually suitable for this connection.</p>
<p>I created a simple wrapper macro around defvar that looks like this:</p>
<pre>
(defvar *conn-vars* nil)
(defmacro def-conn-var (name initial-value &rest args)
`(progn
(defvar ,name ,initial-value ,@args)
(pushnew (cons ',name ,initial-value) *conn-vars* :key 'car)))
</pre>
<p>Now <code>with-db</code> can potentially add new dynamic variable bindings together with <code>*connection*</code> based on the contents of <code>*conn-vars*</code>. It’s pretty trivial to add the new bindings at the macro expansion time. However that poses another problem: now <em>all</em> the conn-vars need to be defined before <code>with-db</code> is expanded. Moreover, if I introduce a new conn-var, all instances of <code>with-db</code> macro must be recompiled. This might be not a problem for something like a desktop app, but my web app usually runs for months without being restarted, with new code being hot-swapped into the running image. I certainly don’t need the extra hassle of having to recompile everything in a specific order.</p>
<p>Meanwhile I had the definition of <a href="http://www.lispworks.com/documentation/HyperSpec/Body/s_let_l.htm"><code>let</code></a> opened in the Hyperspec, and there was a link to <a href="http://www.lispworks.com/documentation/HyperSpec/Body/s_progv.htm"><code>progv</code></a> at the bottom. I had no idea what it does, and thinking that my Lisp has gotten rusty, clicked through to refresh my memory. Imagine my surprise when I found that 1) I have never used this feature before and 2) it was <em>exactly</em> what I needed. Indeed, if I can bind dynamic variables at runtime, then I don’t need to re-expand the macro every time the set of these variables changes.</p>
<p>The final code ended up being pretty messy, but it worked:</p>
<pre>
(defvar *conn-var-cache* (make-hash-table :test #'equal))
(defmacro with-db (dbid &body body)
(alexandria:with-gensyms (pv-pairs var vars val vals iv key exists)
`(let* ((*connection* (get-spec ,dbid))
(,pv-pairs (when ,dbid
(loop for (,var . ,iv) in *conn-vars*
for ,key = (cons ,var *connection*)
for (,val ,exists) = (multiple-value-list (gethash ,key *conn-var-cache*))
collect ,var into ,vars
if ,exists collect ,val into ,vals
else collect ,iv into ,vals
finally (return (cons ,vars ,vals))))))
(progv (car ,pv-pairs) (cdr ,pv-pairs)
(unwind-protect
(with-connection *connection*
,@body)
(loop for ,var in (car ,pv-pairs)
for ,key = (cons ,var *connection*)
do (setf (gethash ,key *conn-var-cache*) (symbol-value ,var))))))))
</pre>
<p>Basically the <code>loop</code> creates a pair of list of variables and list of their values (no idea why <code>progv</code> couldn’t have accepted an alist or something). The values are taken from <code>*conn-var-cache*</code> which takes the pairing of variable name and connection spec as the key. Then I also add an <code>unwind-protect</code> to save the values of the variables that might have changed within the <em>body</em> back into the cache. Note that this makes nested <code>with-db</code>’s unreliable! The fix is possible, and left as an exercise to the reader. Another problem is that dynamic variables bindings don’t get passed into new threads, so no threads should be spawned within the <code>with-db</code> macro.</p>
<p>And this is how I ended up using <code>progv</code> in production. This probably dethrones <em>displaced array strings</em> as the most obscure feature in my codebase. Hopefully I’ll have more things to write about in the future. Until next time!</p>https://readevalprint.tumblr.com/post/145348323128https://readevalprint.tumblr.com/post/145348323128Fri, 03 Jun 2016 05:12:51 -0400common lispprogrammingpostgresqlIt's alive! The path from library to web-app.<p><em>In case you’d rather just play with the website instead of reading this boring post, the url is </em><a href="http://ichi.moe/"><strong><a href="http://ichi.moe/">http://ichi.moe/</a></strong></a></p>
<p>In my previous posts (<a href="http://readevalprint.tumblr.com/post/96866606738/my-little-transliterator-cant-be-this-clos">part 1</a>, <a href="http://readevalprint.tumblr.com/post/97467849358/who-needs-graph-theory-anyway">part 2</a>, <a href="http://readevalprint.tumblr.com/post/99568614213/words-made-out-of-words">part 3</a>) I described the development process of a romanization algorithm for texts in Japanese language. However the ultimate goal was always to make a simple one-purpose web application that makes use of this algorithm. It took quite a while, but it’s finally here. In this post I will describe the technical details behind the development of this website.</p>
<p>I decided to build it with bare <a href="http://weitz.de/hunchentoot/">Hunchentoot</a>; while there are some nice Lisp web frameworks developed lately like <a href="http://restas.lisper.ru/en/">Restas</a> or <a href="http://8arrow.org/caveman/">Caveman</a>, my app would be too simple to need them. There would be a single handler that takes a query and various options as GET parameters, and returns a nicely formatted result.</p>
<p>Now I needed something to produce HTML. I used CL-WHO before, but this time I wanted a templating library where I can just copy-paste plain HTML into. I settled on <a href="https://github.com/archimag/cl-closure-template">closure-templates</a>, which is based on <a href="https://developers.google.com/closure/templates/docs/overview">Google Closure Templates</a> but the syntax is slightly different. Now, I don’t know if I should recommend this library because its documentation in English is rather sparse and it has a dead link in its Github description. It has a <a href="http://archimag.lisper.ru/docs/cl-closure-template/index.html">detailed manual</a> written in Russian, so I was able to read that. As to why I chose it, this library has a major advantage over <a href="http://www.cliki.net/html%20template">its competitors</a>. The same template can be compiled into a Common Lisp function and into Javascript! Why is this useful? Well, for example, I have these cute cards that explain the definition of a word:</p>
<p><figure data-orig-height="275" data-orig-width="243" data-orig-src="https://64.media.tumblr.com/515ca28b4a83824ca268736b06b6d376/tumblr_inline_nekan6DABm1qhuq47.png"><img alt="image" src="https://64.media.tumblr.com/515ca28b4a83824ca268736b06b6d376/tumblr_inline_paengmemEK1qhuq47_540.png" data-orig-height="275" data-orig-width="243" data-orig-src="https://64.media.tumblr.com/515ca28b4a83824ca268736b06b6d376/tumblr_inline_nekan6DABm1qhuq47.png"/></figure></p>
<p>These are mostly generated statically by the Common Lisp backend. But if I want to make such card on the fly client-side, I can call the corresponding Javascript function and it will produce the exact same HTML content. Makes dynamic HTML generation really easy.</p>
<p>For the front-end framework I chose <a href="http://foundation.zurb.com/">Foundation</a>. Bootstrap was my first choice, but it really doesn’t look all that great and it’s difficult to customize. So I decided to try something else. Foundation was pretty nice, it was easy to make my website responsive and look decent on mobile screen. The only problem, like I later discovered, was its sheer size. The 183kb javascript file (minified!) was refused to be cached by my browser for some reason, so each page load took quite a while. Fortunately that was solved by loading this file from cloudflare CDN.</p>
<p>One thing I didn’t concern myself about when writing the backend <a href="https://github.com/tshatrov/ichiran">Ichiran</a> algorithm was thread safety. However, as Hunchentoot uses threads to process requests, this matter becomes very important. Fortunately writing thread-safe code in Lisp is not that hard. Mostly you should just avoid modifying global special variables (binding them with <strong>let</strong> is okay) and be careful with writing persistent data. Since my app is pretty much read-only, there was only one such issue. I am storing a cache of word suffixes in a special variable. Generating this cache takes several seconds, but is only done once per session. As you can guess, this creates problems with thread safety, so I put a lock around this procedure and called it when the server is launched. Each server launch would therefore take several seconds, which is suboptimal. Later I would make the lock non-blocking and display a warning if the init-suffixes procedure is in progress.</p>
<p>Like I said before, I wanted my data to be compatible between Common Lisp and Javascript, so I added some functions to Ichiran to produce JSON objects containing various data. There are many JSON libraries for Common Lisp. I was using <a href="https://github.com/madnificent/jsown">jsown</a> before, so I decided to stick with it. jsown objects are lists with <strong>:obj</strong> as the first element and alist of properties as its cdr. The problem was that closure-templates only supports plists and alists as its context parameter, and jsown object is neither. The solution was to extend the methods <strong>fetch-property</strong> and <strong>fetch-keys</strong>. Since they are already defined for lists, I added :around methods to check for jsown objects specifically and call-next-method on cdr in that case.</p>
<pre class="prettyprint lang-cl">(defmethod closure-template:fetch-property :around ((map list) key)
"Support for jsown dict objects"
(if (and (not (integerp key))
(eql (car map) :obj)
(every #'listp (cdr map)))
(call-next-method (cdr map) key)
(call-next-method)))
(defmethod closure-template:fetch-keys :around ((map list))
(if (and (eql (car map) :obj)
(every #'listp (cdr map)))
(call-next-method (cdr map))
(call-next-method)))
</pre>
<p>Theoretically this would fail if passed a valid plist like ’(:obj (1 2)), but this cannot possibly happen in my application.</p>
<p>Now, at some point I had to actually put my app online. I needed a server and a domain name and I needed them cheap (because I’m currently unemployed (pls hire me)). For the server I chose Linode VPS, and I bought <a href="http://ichi.moe/">ichi.moe</a> domain from Name.com. I still think these new TLDs are a pretty stupid idea, but at least it gives us all an opportunity to buy a short and memorable domain name. I spent the rest of the day configuring my Linode server, which I never did before. Thankfully the documentation they provide is really good.</p>
<p>Because I wanted to get the most juice out of my cheap-ass server, the plan was to put hunchentoot server behind Nginx and to cache everything. There are <a href="https://zaries.wordpress.com/2010/11/09/lisp-web-server-from-scratch-using-hunchentoot-and-nginx/">existing</a> <a href="http://blog.ponto-dot.com/2009/08/18/hunchentoot-behind-proxy-server/">guides</a> on how to do this setup, which were very helpful. In my setup everything is served by Nginx except for URLs that start with /cl/, which are passed to Hunchentoot. The static pages (including error pages) are also generated by closure-template (so that the design is consistent), but they are just dumped into .html files served by Nginx. Nginx also caches dynamic content, which might help if some high-traffic site links to a certain query. This, and the fact that Linodes are hosted on SSD made the site run pretty smooth.</p>
<p>Now let’s talk about my infrastructure. As described in the guides above, I have a special hunchentoot user in addition to the main user. The main user’s quicklisp directory is symlinked to hunchentoot’s so the server can load the code but cannot write there. The code is stored in 2 repositories. One is the open-source core of the project (<a href="https://github.com/tshatrov/ichiran">ichiran</a>) and the other one is a private bitbucket repository ichiran-web which holds web-related code. However a simple git pull doesn’t update the code running on the server. If I’m lazy, I do “sudo service hunchentoot restart”, which restarts everything and reloads the code. This might of course create service interruptions for the users. Another option is hot swapping all the changes. For this purpose my hunchentoot server also starts a swank server like this:</p>
<pre class="prettyprint lang-cl">(defun start-app (&optional (port 8080))
(handler-case (swank:create-server :dont-close t)
(error ()))
(ichiran/dict:init-suffixes)
(refresh-handlers)
(let ((acceptor (make-instance 'easy-acceptor :port port
:access-log-destination *access-log*
:message-log-destination *message-log*
)))
(setf *ichiran-web-server* (start acceptor))))
</pre>
<p>Swank is, of course, the server-side component of <a href="https://github.com/slime/slime">SLIME</a>. It runs on a port that is not accessible remotely and can only be connected to locally or via SSH tunnel. I use the latter to connect SLIME on my PC to Swank running on my server, which allows me to apply various fixes without restarting, either from the REPL or by using C-c C-c to recompile some function.</p>
<p>Anyway, I’m pretty happy with the way things turned out, and I got some positive feedback already. The biggest thing left is tightening up the web design, which is my least favorite part of web development. The other thing is attracting enough traffic so that I can analyze the performance (I’m only getting a few people a day right now, which barely makes a blip on my server’s CPU graph).</p>
<p>In retrospect, getting this website up and running was pretty easy. I spent much more time trying to tweak ichiran library to split the sentences in a correct way (and I’m still working on it). It’s not much harder than, say, building a Django-based site. The tools are all there, the documentation is out there (kind of). VPSes are cheap. And it spreads awareness of Common Lisp. No reason not to try!</p>https://readevalprint.tumblr.com/post/101841449553https://readevalprint.tumblr.com/post/101841449553Wed, 05 Nov 2014 08:36:56 -0500common lisplispprogrammingweb programmingjapaneseWords made out of words<p>Since my <a href="http://readevalprint.tumblr.com/post/97467849358/who-needs-graph-theory-anyway">last post</a> I’ve done a lot of work on my Japanese sentence-segmenting algorithm, so it’s time for an update.</p>
<p>First of all, I added conjugations. <a href="http://www.edrdg.org/jmdictdb/cgi-bin/conj.py?e=1039987&svc=jmdict&sid=&e=">Here’s how JMdict does conjugations</a>. That’s for a single verb. There’s a note saying “this table has been automatically generated”; indeed, in JMdict conjugations are generated on a basis of a rather large .csv file and are not stored in the database. Obviously for my purposes it is more efficient to have these in my database, so I ported a (rather simple) algorithm to Common Lisp and wrote a (really complex) procedure to load them. It takes quite a while to INSERT those one by one, which made me wish <a href="http://marijnhaverbeke.nl/postmodern/">postmodern</a> had some sort of bulk inserting mechanism. Some time later I discovered that some of these conjugations are themselves verbs that can be (and often are) conjugated. So I added “second level” conjugations that point both to first level conjugation and to the original verb. Hopefully “third level” conjugations are rarely used.<br/></p>
<p>Meanwhile I’ve been trying to improve the segmentation algorithm. The first major change was calculating <em>n</em> best segmentations instead of just one. That would allow me to have a better picture of what the algorithm prefers. I came up with the structure that I call top-array, which is basically an array of <em>n</em> scores sorted from the biggest to smallest and when a new score is added, we go from the end and push everything smaller than the new score to the right. I thought it was pretty elegant and probably the fastest way to do this for small <em>n</em> (obviously some sort of tree would work better for large <em>n</em>).</p>
<pre class="prettyprint lang-cl">(defstruct (top-array-item (:conc-name tai-)) score payload)
(defclass top-array ()
((array :reader top-array)
(count :reader item-count :initform 0)
))
(defmethod initialize-instance :after ((obj top-array) &key (limit 5))
(setf (slot-value obj 'array) (make-array limit :initial-element nil)))
(defgeneric register-item (collection score payload)
(:method ((obj top-array) score payload)
(with-slots (array count) obj
(let ((item (make-top-array-item :score score :payload payload))
(len (length array)))
(loop for idx from (min count len) downto 0
for prev-item = (when (> idx 0) (aref array (1- idx)))
for done = (or (not prev-item) (>= (tai-score prev-item) score))
when (< idx len) do (setf (aref array idx) (if done item prev-item))
until done)
(incf count)))))
(defgeneric get-array (collection)
(:method ((obj top-array))
(with-slots (array count) obj
(if (>= count (length array)) array (subseq array 0 count)))))
</pre>
<p>An instance of top-array is created for every segment (found word in a sentence), as well as one for the entire sentence, from which the best path (a sequence of words) is taken in the end. Then the basic algorithm is similar to the one described in my previous post, but gains an extra inner loop.</p>
<pre class="prettyprint lang-cl">(loop for (seg1 . rest) on segments
for score1 = (get-segment-score seg1)
do (register-item (segment-top seg1) score1 (list seg1))
(register-item top score1 (list seg1))
(loop for seg2 in rest
for score2 = (get-segment-score seg2)
when (>= (segment-start seg2) (segment-end seg1)) do
(loop for tai across (get-array (segment-top seg1))
for path = (cons seg2 (tai-payload tai))
for score = (+ score2 (tai-score tai))
do (register-item (segment-top seg2) score path)
(register-item top score path))))
</pre>
<p>Then (get-array top) would return n best paths.</p>
<p>After this I started thinking on how to make my algorithm more context-sensitive. The way in which every segment is scored is completely independent of the other segments, which might cause best scored path to be a sequence of words that make no sense when put next to each other! The above algorithm is easy to modify to add some sort of bonus to two subsequent segments, so my first attempt was to encourage words that like to be next to each other in natural language with some extra score (I called that “synergy”). So, for example, there are “no-adjectives”, which are basically nouns, but when followed by particle “no” they become adjectives. I added a synergy that adds 15 points if such word is followed by particle “no”. In the end this way to do things has proven itself limited. Words can have wildly different scores and when things go wrong, extra 15 points might not be enough to make them right. On the other hand, if I increase this bonus too much, this might erroneously break up words that just so happen to have “no” in them.</p>
<p>Later I came up with the concept of compound words, which are “words” that don’t exist in the database, but rather consist of several words that do exist in the database. Right now, it’s mostly a primary word + one or several suffixes, but potentially there could be prefixes too. For the purposes of segmentation a compound word acts like one single word. One example of a common suffix would be “たい” (-tai) , which follows a verb (“to X”) conjugated in a certain way and the resultant meaning is “to want to X”. Most of these suffixes themselves have many conjugations. To check if a word can be understood as a compound word, I need to check if it ends with one of many suffixes, and then check if the part before the suffix has correct part of speech or conjugation. All possible suffixes and their meanings are put into a hashtable and then we can check if a word ends with some of them by checking all its endings versus the hashtable.</p>
<pre class="prettyprint lang-cl">(defun get-suffixes (word)
(init-suffixes)
(loop for start from (1- (length word)) downto 1
for substr = (subseq word start)
for val = (gethash substr *suffix-cache*)
when val
collect (cons substr val)))
</pre>
<p>The concept of suffixes has fared much better as now I am able to calculate scores of compound words in a more versatile way.</p>
<p>I would still sometimes encounter phrases that are split badly by my algorithm, but a human would segment easily. For example if the words “AB” and “ABC” both exist in database, but “AB” happens to score higher (e.g. because it’s a really common word, while ABC is not so much), then “ABC” would never be segmented as one word “ABC”, it would be “AB”+“C”, even if “C” is a completely worthless word, or even not a word at all (a gap). An example of a “worthless” word is a hiragana spelling of one-syllable word that would normally be spelled with a kanji. I didn’t care about those much, because they had really low scores and thus only appeared when something went awry. However getting rid of these low-scoring words would allow me to place a large penalty on gaps and thus “ABC” will be able to score higher than “AB”+gap. In the path-finding algorithm above the same score is put into <em>top</em> and <em>segment-top</em> top-arrays. But if we want to penalize gaps, the score put into <em>top</em> should also include a penalty for the gap to the right of the last segment, if it exists. Penalties for gaps to the left of the leftmost segment and in-between segments should be added to both.</p>
<p>Anyway, I’m pretty happy with how this thing is progressing, and I’m going to switch my efforts to building a web-interface. Here’s how it currently works in REPL:</p>
<p>(<a href="http://i.imgur.com/EWjXyiT.png">click here for full-res image</a>)</p>
<p><img alt="image" src="http://i.imgur.com/EWjXyiT.png"/></p>
<p>Kinda messy, isn’t it? The challenge would be to display all this information in reasonable manner. I already have some ideas, but it would still probably take some effort to decipher. But then again, translating the sentence was never the goal, just romanizing it, which <a href="https://github.com/tshatrov/ichiran/">ichiran</a> does pretty well right now.</p>https://readevalprint.tumblr.com/post/99568614213https://readevalprint.tumblr.com/post/99568614213Thu, 09 Oct 2014 11:07:00 -0400common lisplispprogrammingjapaneseichiranWho needs graph theory anyway?<p>In <a href="http://readevalprint.tumblr.com/post/96866606738/my-little-transliterator-cant-be-this-clos">my last post</a> I discussed how to make a Japanese->English transliterator and outlined some problems that limited its usefulness. One problem is that there’s no obvious way to segment a sentence into words. I looked up existing solutions, and a lightweight Javascript implementation <a href="http://chasen.org/~taku/software/TinySegmenter/">caught my eye</a>. I quickly <a href="https://github.com/tshatrov/ichiran/blob/master/tokenize.lisp">ported it to Common Lisp</a> and to the surprise of absolutely no one, the results were awful</p>
<p>It was clear that I needed an actual database of Japanese words to do segmentation properly. This would also solve the “kanji problem” since this database would also include how to pronounce the words. My first hunch was Wiktionary, but it’s dump format turned out to be pretty inefficient for parsing.</p>
<p>Fortunately I quickly discovered a free <a href="http://www.edrdg.org/jmdict/j_jmdict.html">JMDict</a> database which was exactly what I needed. It even had open-source code in Python for parsing and loading its XML dumps. Naturally, I wrote my own code to parse it since its database schema looked too complex for my needs. But I’m not going to discuss that in this post, as it is quite boring.</p>
<p>Since now I had a comprehensive Postgres database of every word in Japanese language (not really, as it doesn’t include conjugations) it was only a matter of identifying the words in the sentence. To do this, for every substring of a sentence look up the database for exact matches. There are n(n+1)/2 substrings in a string, so we aren’t doing too badly in terms of performance (and the string wouldn’t be too long anyway since prior to running this procedure I’ll be splitting it by punctuation etc.)</p>
<pre class="prettyprint lang-cl">(defstruct segment
start end word))
(defun find-substring-words (str)
(loop for start from 0 below (length str)
nconcing
(loop for end from (1+ start) upto (length str)
for substr = (subseq str start end)
nconcing (mapcar
(lambda (word)
(make-segment :start start :end end :word word))
(find-word substr)))))
</pre>
<p>The problem is that there’s a lot of words, and many of them are spelled identically. I decided to assign a score to each word based on its length (longer is better), whether it’s a preferred spelling of the word, how common the word is and whether it’s a particle (which tend to be short and thus need a boost to increase their prominence).</p>
<p>Now we have the following problem: for a sentence, find the set of non-intersecting segments with the maximum total score. Now, you might have better mathematical intuition than I, but my first thought was:</p>
<blockquote>
<p>This looks <a href="http://en.wikipedia.org/wiki/NP-hard">NP-hard</a>, man. This problem has “<a href="http://en.wikipedia.org/wiki/Travelling_salesman_problem">travelling salesman</a>” written all over it.</p>
</blockquote>
<p>My first attempt to crack it was to calculate score per letter for each word and select words with the highest scores. But a counterexample comes to mind rather easily: in a sentence “ABC” with words “AB” (score=5), “BC” (score=5) and “ABC” (score=6), words “AB” and “BC” have a higher score per letter (2.5), but the optimal covering is provided by the word “ABC” with its score per letter a measly 2.</p>
<p>At this point I was working with the most convenient mathematical instrument, which is pen and paper. The breakthrough came when I started to consider a certain relation between two segments: the segment <em>a</em> can be followed by the segment <em>b </em><a href="http://en.wikipedia.org/wiki/If_and_only_if">iff</a> (segment-start <em>b</em>) is greater or equal to (segment-end <em>a</em>). Under this relation our segments form <a href="http://en.wikipedia.org/wiki/Transitive_relation">transitive</a> <a href="http://en.wikipedia.org/wiki/Directed_acyclic_graph">directed acyclic graph</a>. The proof is left as an exercise for the reader. Clearly we just need to do a <a href="http://en.wikipedia.org/wiki/Transitive_reduction">transitive reduction</a> and use something similar to <a href="http://en.wikipedia.org/wiki/Dijkstra's_algorithm">Dijkstra’s algorithm</a> to find the path with the maximal score! This problem is clearly solvable in polynomial time!</p>
<p>Pictured: actual notes drawn by me</p>
<p><figure class="tmblr-full" data-orig-height="378" data-orig-width="500" data-orig-src="https://64.media.tumblr.com/e33241791f0e56c053f1c6b2486aff6c/tumblr_inline_nbw26pJZwq1qhuq47.jpg"><img alt="image" src="https://64.media.tumblr.com/e1f7ebf808f22625ff06a7d8ebd67381/tumblr_inline_pk8bqxQYMS1qhuq47_540.jpg" data-orig-height="378" data-orig-width="500" data-orig-src="https://64.media.tumblr.com/e33241791f0e56c053f1c6b2486aff6c/tumblr_inline_nbw26pJZwq1qhuq47.jpg"/></figure></p>
<p>In reality the algorithm turns out to be quite simple. Since find-substring-words always returns segments sorted by their start and then by their end, every segment can only be followed by the segments after it. We can then accumulate the largest total score and the path used for it for every segment by using a nested loop:</p>
<pre class="prettyprint lang-cl">(defstruct segment
start end word (score nil) (accum 0) (path nil))
(defun find-best-path (segments)
;;assume segments are sorted by (start, end) (as is the result of find-substring-words)
(let ((best-accum 0)
(best-path nil))
(loop for (seg1 . rest) on segments
when (> (segment-score seg1) (segment-accum seg1))
do (setf (segment-accum seg1) (segment-score seg1)
(segment-path seg1) (list seg1))
(when (> (segment-accum seg1) best-accum)
(setf best-accum (segment-accum seg1)
best-path (segment-path seg1)))
when (> (segment-score seg1) 0)
do (loop for seg2 in rest
if (>= (segment-start seg2) (segment-end seg1))
do (let ((accum (+ (segment-accum seg1) (segment-score seg2))))
(when (> accum (segment-accum seg2))
(setf (segment-accum seg2) accum
(segment-path seg2) (cons seg2 (segment-path seg1)))
(when (> accum best-accum)
(setf best-accum accum
best-path (segment-path seg2)))))))
(values (nreverse best-path) best-accum)))
</pre>
<p>Of course when I actually tried to run this algorithm, SBCL just crashed. How could that be? It took me a while to figure out, but notice how segment-path contains a list that includes the segment itself. A recursive self-referential structure! When SBCL tried to print that in the REPL, it didn’t result in <a href="http://catb.org/jargon/html/N/nasal-demons.html">dragons flying out of my nose</a> but a crash still happened. Interestingly, Common Lisp has a solution to this: if <a href="http://clhs.lisp.se/Body/v_pr_cir.htm">*print-circle*</a> is set to t, it will actually print the structure using referential tokens. Anyway, I just added the following before returning the result to remove self-references:</p>
<pre class="prettyprint lang-cl"> (dolist (segment segments)
(setf (segment-path segment) nil))
</pre>
<p>So, did it work? Yes, it did, and the result was impressive! Even though my scoring system is pretty barebones, it’s on par or even better than Google Translate’s romanization on a few test sentences I tried. I still need to add conjugations, and it can’t do personal names at all, but considering how little code there is and the fact that it doesn’t even attempt grammatical analysis of the sentence (due to me not knowing the language) I am very happy with the result. Also I plan to add a web interface to it so that it’s possible to hover over words and see the translation. That would be pretty useful. The work in progress code is on my <a href="https://github.com/tshatrov/ichiran/">Github</a>.</p>https://readevalprint.tumblr.com/post/97467849358https://readevalprint.tumblr.com/post/97467849358Sun, 14 Sep 2014 08:41:25 -0400common lisplispprogramminggraph theorymathjapaneseMy little transliterator can't be this CLOS<p>If you are reading this blog, you are probably able to read Latin script. It is pretty widespread in the world, and used by 70% of the world’s population <a href="http://en.wikipedia.org/wiki/Latin_script">according to Wikipedia</a>. Perhaps, like me, your native language uses a <a href="http://en.wikipedia.org/wiki/Cyrillic_script">different script</a>. There are many <a href="http://en.wikipedia.org/wiki/Writing_system">writing systems</a> in the world, some are related, and some are wildly different from each other. <span class="st">Fortunately with the advent of the Internet and tools like Google Translate it is increasingly possible to read text not only in the language you don’t understand, but even the languages where you don’t even understand their writing system.</span><br/></p>
<p><span class="st">Well, Google is Google, but is it possible for a mere mortal to create something like that? Not to translate, but just to present some unknown writing system in your preferred alphabet (the process is called transliteration or transcription)? There’s no reason why not.</span></p>
<p><span class="st">In this post I’ll talk about the process of romanization of Japanese language, which is transcription from Japanese to Latin script. <span class="st">For example “ありがとうございます</span>” is romanized to “arigatō gozaimasu” under Hepburn romanization method (there are many of those).<br/></span></p>
<p><span class="st">First off, the basics of Japanese writing are as follows:</span></p>
<ol>
<li><span class="st">There are several scripts used to write in Japanese language.</span></li>
<li><span class="st"><em>Hiragana</em> is a syllabary (a writing system where each character represents a syllable) that is used for words of Japanese origin.</span></li>
<li><span class="st"><em>Katakana</em> is another syllabary that is used for loan words. Every possible syllable in Japanese language has a hiragana and katakana form, which usually are completely different. Both scripts have about 50 characters in them.</span></li>
<li><span class="st">Chinese characters (<em>kanji</em>) are used for words of Japanese and Chinese origin. There are thousands of such characters. Furthermore, most of them could be read in several different ways, which makes transcribing them difficult. We’re going to ignore those for now.</span></li>
</ol>
<p><br/>If we focus on romanization of hiragana and katakana (both systems are called <em>kana</em> for short) then the process seems pretty simple. It’s just a matter of replacing each kana with the syllable it represents, written in roman letters. However there are some characters that do not represent a syllable, but rather modify a syllable before or after that character. This includes <em>sokuon</em>, which doubles the consonant of the next syllable and <em>yoon </em>characters, which are a small version of normal kana and are used to modify a vowel of a preceding syllable.</p>
<p>Ok, so the first thing we must do is to somehow bring order to this madness. Since there is hiragana and katakana version of each character, it doesn’t make sense to work with the characters directly. Instead I’m going to replace each character with a keyword.</p>
<pre class="prettyprint lang-cl">(defparameter *sokuon-characters* '(:sokuon "っッ"))
(defparameter *iteration-characters* '(:iter "ゝヽ" :iter-v "ゞヾ"))
(defparameter *modifier-characters* '(:+a "ぁァ" :+i "ぃィ" :+u "ぅゥ" :+e "ぇェ" :+o "ぉォ"
:+ya "ゃャ" :+yu "ゅュ" :+yo "ょョ"
:long-vowel "ー"))
(defparameter *kana-characters*
'(:a "あア" :i "いイ" :u "うウ" :e "えエ" :o "おオ"
:ka "かカ" :ki "きキ" :ku "くク" :ke "けケ" :ko "こコ"
:sa "さサ" :shi "しシ" :su "すス" :se "せセ" :so "そソ"
:ta "たタ" :chi "ちチ" :tsu "つツ" :te "てテ" :to "とト"
:na "なナ" :ni "にニ" :nu "ぬヌ" :ne "ねネ" :no "のノ"
:ha "は" :hha "ハ" :hi "ひヒ" :fu "ふフ" :he "へヘ" :ho "ほホ"
:ma "まマ" :mi "みミ" :mu "むム" :me "めメ" :mo "もモ"
:ya "やヤ" :yu "ゆユ" :yo "よヨ"
:ra "らラ" :ri "りリ" :ru "るル" :re "れレ" :ro "ろロ"
:wa "わワ" :wi "ゐヰ" :we "ゑヱ" :wo "を" :wwo "ヲ"
:n "んン"
:ga "がガ" :gi "ぎギ" :gu "ぐグ" :ge "げゲ" :go "ごゴ"
:za "ざザ" :ji "じジ" :zu "ずズ" :ze "ぜゼ" :zo "ぞゾ"
:da "だダ" :dji "ぢヂ" :dzu "づヅ" :de "でデ" :do "どド"
:ba "ばバ" :bi "びビ" :bu "ぶブ" :be "べベ" :bo "ぼボ"
:pa "ぱパ" :pi "ぴピ" :pu "ぷプ" :pe "ぺペ" :po "ぽポ"
))
(defparameter *all-characters* (append *sokuon-characters*
*iteration-characters*
*modifier-characters*
*kana-characters*))
(defparameter *char-class-hash*
(let ((hash (make-hash-table)))
(loop for (class chars) on *all-characters* by #'cddr
do (loop for char across chars
do (setf (gethash char hash) class)))
hash))<br/><br/>(defun get-character-classes (word)<br/> (map 'list (lambda (char) (gethash char *char-class-hash* char)) word))
</pre>
<p> This creates a hash table that maps every kana to a keyword that describes it and we can now trivially convert a word into a list of “character classes” (or the characters themselves for non-kana characters). Then we need to transform this list into a kind of <a href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">AST</a> where modifier characters have the role of functions.</p>
<pre class="prettyprint lang-cl">(defun process-modifiers (cc-list)
(loop with result
for (cc . rest) on cc-list
if (eql cc :sokuon)
do (push (cons cc (process-modifiers rest)) result) (loop-finish)
else if (member cc *modifier-characters*)
do (push (list cc (pop result)) result)
else do (push cc result)
finally (return (nreverse result))))
</pre>
<p>This is your basic push/nreverse idiom with some extra recursiveness added. Sokuon is applied to everything to the right of it, because I wanted it to have lower precedence, i.e. (:sokuon :ka :+yu) is parsed as (:sokuon (:+yu :ka)) instead of the other way around. Now we can write the outline of our algorithm:</p>
<pre class="prettyprint lang-cl">(defun romanize-core (method cc-tree)
(with-output-to-string (out)
(dolist (item cc-tree)
(cond ((null item))
((characterp item) (princ item out))
((atom item) (princ (r-base method item) out))
((listp item) (princ (r-apply (car item) method (cdr item)) out))))))
</pre>
<p>The functions r-base and r-apply are generic functions that will depend on the method of romanization. Another generic function will be r-simplify that will “pretty up” the result. It is easy to write some reasonable fallback methods for them:</p>
<pre class="prettyprint lang-cl">(defgeneric r-base (method item)
(:documentation "Process atomic char class")
(:method (method item)
(string-downcase item)))
(defgeneric r-apply (modifier method cc-tree)
(:documentation "Apply modifier to something")
(:method ((modifier (eql :sokuon)) method cc-tree)
(let ((inner (romanize-core method cc-tree)))
(if (zerop (length inner)) inner
(format nil "~a~a" (char inner 0) inner))))
(:method ((modifier (eql :long-vowel)) method cc-tree)
(romanize-core method cc-tree))
(:method ((modifier symbol) method cc-tree)
(format nil "~a~a" (romanize-core method cc-tree) (string-downcase modifier))))
(defgeneric r-simplify (method str)
(:documentation "Simplify the result of transliteration")
(:method (method str) str))
</pre>
<p>Of course relying on symbol names isn’t flexible at all. It’s better to have a mapping from each keyword to a string that represents it. This is where we have to resort to classes to store this mapping in a slot.</p>
<pre class="prettyprint lang-cl">(defclass generic-romanization ()
((kana-table :reader kana-table
:initform (make-hash-table))))
(defmethod r-base ((method generic-romanization) item)
(or (gethash item (kana-table method)) (call-next-method)))
(defmethod r-apply ((modifier symbol) (method generic-romanization) cc-tree)
(let ((yoon (gethash modifier (kana-table method))))
(if yoon
(let ((inner (romanize-core method cc-tree)))
(format nil "~a~a" (subseq inner 0 (max 0 (1- (length inner)))) yoon))
(call-next-method))))
<br/>(defmacro hash-from-list (var list)<br/> (alexandria:with-gensyms (hash key val)<br/> `(defparameter ,var<br/> (let ((,hash (make-hash-table)))<br/> (loop for (,key ,val) on ,list<br/> do (setf (gethash ,key ,hash) ,val))<br/> ,hash))))<br/><br/>(hash-from-list *hepburn-kana-table*<br/> '(:a "a" :i "i" :u "u" :e "e" :o "o"<br/> :ka "ka" :ki "ki" :ku "ku" :ke "ke" :ko "ko"<br/> :sa "sa" :shi "shi" :su "su" :se "se" :so "so"<br/> :ta "ta" :chi "chi" :tsu "tsu" :te "te" :to "to"<br/> :na "na" :ni "ni" :nu "nu" :ne "ne" :no "no"<br/> :ha "ha" :hha "ha" :hi "hi" :fu "fu" :he "he" :ho "ho"<br/> :ma "ma" :mi "mi" :mu "mu" :me "me" :mo "mo"<br/> :ya "ya" :yu "yu" :yo "yo"<br/> :ra "ra" :ri "ri" :ru "ru" :re "re" :ro "ro"<br/> :wa "wa" :wi "wi" :we "we" :wo "wo" :wwo "wo"<br/> :n "n"<br/> :ga "ga" :gi "gi" :gu "gu" :ge "ge" :go "go"<br/> :za "za" :ji "ji" :zu "zu" :ze "ze" :zo "zo"<br/> :da "da" :dji "ji" :dzu "zu" :de "de" :do "do"<br/> :ba "ba" :bi "bi" :bu "bu" :be "be" :bo "bo"<br/> :pa "pa" :pi "pi" :pu "pu" :pe "pe" :po "po"<br/> :+a "a" :+i "i" :+u "u" :+e "e" :+o "o"<br/> :+ya "ya" :+yu "yu" :+yo "yo"<br/> ))<br/>
(defclass generic-hepburn (generic-romanization)
((kana-table :initform (alexandria:copy-hash-table *hepburn-kana-table*))))
</pre>
<p>I’m going for a rather versatile class hierarchy here, starting with a completely empty kana-table for generic-romanization method, but defining the methods on it that will work for any table. Then I define a class generic-hepburn that will be the basis for different hepburn variations. The table is taken from Wikipedia article on <a href="http://en.wikipedia.org/wiki/Hepburn_romanization">Hepburn romanization</a>, which is pretty detailed. By carefully reading it, we can identify the exceptions that the above functions can’t handle. For example a :sokuon before :chi is romanized as “tchi” and not as “cchi” as it would by the simple consonant-doubling method. Another exception is that, for example, :chi followed by :+ya is romanized as “cha”, not “chya”. CLOS makes it easy to handle these irregularities before passing the torch to a less specific method.</p>
<pre class="prettyprint lang-cl">(defmethod r-apply ((modifier (eql :sokuon)) (method generic-hepburn) cc-tree)
(if (eql (car cc-tree) :chi)
(concatenate 'string "t" (romanize-core method cc-tree))
(call-next-method)))
(defmethod r-apply ((modifier (eql :+ya)) (method generic-hepburn) cc-tree)
(case (car cc-tree)
(:shi "sha")
(:chi "cha")
((:ji :dji) "ja")
(t (call-next-method))))
... and the same for :+yu and :+yo
</pre>
<p>Another thing Hepburn romanizations do is simplifying double vowels like “oo”, “ou” and “uu”. For example, our generic-hepburn will romanize “とうきょう” as “toukyou”, while most people are more familiar with the spelling “Tokyo” or “Tōkyō”.</p>
<pre class="prettyprint lang-cl">(defun simplify-ngrams (str map)
(let* ((alist (loop for (from to) on map by #'cddr collect (cons from to)))
(scanner (ppcre:create-scanner (cons :alternation (mapcar #'car alist)))))
(ppcre:regex-replace-all scanner str
(lambda (match &rest rest)
(declare (ignore rest))
(cdr (assoc match alist :test #'equal)))
:simple-calls t)))
(defclass simplified-hepburn (generic-hepburn)
((simplifications :initform nil :initarg :simplifications :reader simplifications
:documentation "List of simplifications e.g. (\"ou\" \"o\" \"uu\" \"u\")"
)))
(defmethod r-simplify ((method simplified-hepburn) str)
(simplify-ngrams (call-next-method) (simplifications method)))<br/><br/>(defclass traditional-hepburn (simplified-hepburn)<br/> ((simplifications :initform '("oo" "ō" "ou" "ō" "uu" "ū"))))<br/>
</pre>
<p>I’m using the “<a href="http://weitz.de/cl-ppcre/#create-scanner2">parse tree</a>” feature of CL-PPCRE here to create a complex :alternation regex on the fly and then use regex-replace-all with a custom replacing function. It’s <em>probably</em> not the most efficient method, but sometimes outsourcing string manipulations to a well-tested regex engine is the least painful solution. Anyway, we’re really close now, and all that’s left is to chain up our functions for a useful API.</p>
<pre class="prettyprint lang-cl">(defparameter *hepburn-traditional* (make-instance 'traditional-hepburn))
(defvar *default-romanization-method* *hepburn-traditional*)
(defun romanize-list (cc-list &key (method *default-romanization-method*))
"Romanize a character class list according to method"
(let ((cc-tree (process-modifiers cc-list)))
(values (r-simplify method (romanize-core method cc-tree)))))
(defun romanize-word (word &key (method *default-romanization-method*))
"Romanize a word according to method"
(romanize-list (get-character-classes word) :method method))<br/><br/>>>> (romanize-word "ありがとうございます")<br/>"arigatōgozaimasu"
</pre>
<p>At my <a href="https://github.com/tshatrov/ichiran/">Github</a> you can find an unabridged version of the above code. However there are still some difficult problems with romanization of Japanese that can’t be solved as easily. Even leaving kanji aside, the hiragana character は is pronounced either as “ha” or “wa” depending on whether it is used as a particle. For example a common greeting “こんにちは” is romanized as “konnichiwa” and not “konnichiha” because は plays the role of a particle. Which brings us to another problem: there are no spaces between the words, so it’s not possible to determine whether は is a part of a word or a standalone particle without a dictionary, and even then it can be ambiguous! I’m ending the post on this note, since I’m still not sure how to solve this. さようなら!</p>https://readevalprint.tumblr.com/post/96866606738https://readevalprint.tumblr.com/post/96866606738Sun, 07 Sep 2014 06:00:16 -0400common lisplispprogrammingjapaneselanguageLiving on the edge<p>Lately my primary Lisp has been a <a href="https://github.com/akovalenko/sbcl-win32-threads">SBCL fork for Windows</a> which is based on SBCL 1.1.4 and is now pretty old. The official release of SBCL for Windows is 1.2.1 so I decided to try it out. The installer managed to delete my old version of SBCL, so there was no way back now. I tried to run it, but it still tried to use .core from the old SBCL. Strange, I’m pretty sure the system environment variables have been updated. Yep, I go to system settings and SBCL_HOME points at the correct directory. I run “cmd” and nope, SBCL_HOME points at the old directory. How could that be? After some mucking about, I save the environment variables again and now it has updated. SBCL now runs from command line. Success?</p>
<p>Ok, so I run SLIME and it tries to use some symbol from SBCL system package which has clearly been removed at some point. My SLIME isn’t even that old, last updated in 2013. I actually installed it via Quicklisp, wonder if this will work? I run SBCL from command line and do (ql:update-all-dists). Lots of libraries get updated, including SLIME 2014-08-01. Oh, this is good stuff.</p>
<p>I start up Emacs, load SLIME and face a certain bug I already faced on another computer. At some point SLIME became, let’s say, not very compatible with Emacs 24.1 and 24.2 series, because Emacs developers did something with ‘cl package and SLIME relies on that change. Guess I’ll have to update Emacs too.</p>
<p>As a result I have been forced to update to a shiny new Lisp stack from 2014. To compare, at work we have to use Python 2.6 (released in 2008) and Django 1.3 (released in 2011 and already deprecated). It’s actually amazing how many libraries still run on Python 2.6. Meanwhile Common Lisp as the language hasn’t changed since like the 80s and yet you must always watch out for compatibility issues! Keep up with the times!</p>https://readevalprint.tumblr.com/post/96073154058https://readevalprint.tumblr.com/post/96073154058Fri, 29 Aug 2014 04:50:33 -0400common lispprogrammingpythonsbclemacsslimequicklispWeb scraping with Common Lisp: cookies and stuff<p>It’s been a long time since my last post, but let’s pick up where I left off.</p>
<p><a href="http://readevalprint.tumblr.com/post/80764585017/web-scraping-with-common-lisp-introduction">Read part 1 here!</a></p>
<p>Often you want to grab some data from a website but you can’t just send a get request to a page, you need to log in first. How does the site even know you’re logged in? Well, attached to each request is a bunch of <em>cookies</em>, which are essentialy name-value pairs. Moreover, the server’s response may update or add new cookies, which are faithfully stored by your browser for some period of time. So when you are logging in to a site the following happens:</p>
<ol>
<li>You send a POST request with your name and password as parameters.</li>
<li>The server responds by setting up your cookies in a way that allows it to recognize your future requests. Usually it sets a cookie that contains your “session id”, which uniquely identifies your current browser session.</li>
<li>When you make any requests after that, the cookie that contains session id is sent along with them, and the server assumes you are logged in.</li>
</ol>
<p>As you can see, the whole algorithm hinges on the fact that your browser must store and resend the cookie that the server has set up. And when you are making requests through a library or a command-line tool such as <em>curl</em> or <em>wget</em>, the responsibility to store and pass the cookies lies upon you.</p>
<p>Ok, so with Common Lisp we’re using the DRAKMA library. By default it will not send any cookies, or do anything with received cookies. However if you pass a special <em>cookie jar</em> object as a keyword parameter to http-request, it will send cookies from it, and update them based on the server’s response. If you use the same cookie jar object to POST a login request and later to retrieve some data, <em>usually</em> this will be enough to trick the server into serving you the content behind the authentication wall.</p>
<pre class="prettyprint lang-cl"> (let ((cookie-jar (make-instance 'drakma:cookie-jar)))
(drakma:http-request login-url :method :post :parameters login-parameters :cookie-jar cookie-jar)
(drakma:http-request data-url :cookie-jar cookie-jar))
</pre>
<p>I think it’s annoying to always write “:cookie-jar cookie-jar” for every request, so in my library <a href="https://github.com/tshatrov/webgunk/blob/master/webgunk.lisp">webgunk</a> a special variable *webgunk-cookie-jar* is passed as the requests’ cookie jar (it’s nil by default). So you can instead:</p>
<pre class="prettyprint lang-cl"> (let ((*webgunk-cookie-jar* (make-instance 'drakma:cookie-jar)))
(http-request login-url :method :post :parameters login-parameters)
(http-request data-url))
</pre>
<p>Special variables are sure handy. In <a href="https://github.com/tshatrov/webgunk/blob/master/modules.lisp">webgunk/modules</a> I created an object-oriented API that uses this feature and <a href="https://github.com/tshatrov/webgunk/blob/master/reddit.lisp">webgunk/reddit</a> is a simple reddit browser based on it. Here’s the code for authorization:</p>
<pre class="prettyprint lang-cl">(defmethod authorize ((obj reddit-browser) &key login password)
(let ((login-url "https://ssl.reddit.com/api/login"))
(with-cookie-jar obj
(http-request login-url :method :post
:parameters `(("api_type" . "json")
("user" . ,login)
("passwd" . ,password))))))
</pre>
<p>where with-cookie-jar is just</p>
<pre class="prettyprint lang-cl">(defmacro with-cookie-jar (obj &body body)
`(let ((*webgunk-cookie-jar* (get-cookie-jar ,obj)))
,@body))
</pre>
<p>Note that logging in isn’t always as easy. Sometimes the server’s procedure for setting the cookies is rather tricky (e.g. involving Javascript and redirects). However you <em>almost always</em> can trick the server that you’re logged in by logging in <em>with your browser</em> and then copying the cookie values from your browser (this is known as <a href="http://en.wikipedia.org/wiki/Session_hijacking">session hijacking</a>, except you’re only hijacking your own session so it’s ok).</p>
<p>For example, I used to play an online game called <a href="http://www.saltybet.com/">SaltyBet</a>, in which you place imaginary money bets on which character will win in a fight. The outcome could be predicted by analyzing the past fights of each character. After losing a million of SaltyBucks due to suboptimal betting, I have built a system that would collect the results of past fights from SaltyBet’s own website, calculate and display the stats for each character and also show their most recent fights, and the biggest upsets that they have been involved in. This was incredibly effective and I was able to recoup my lost money twice over!</p>
<p>But anyway, the data was available only to paid members so I needed to log in to scrape it. And the method described above did not work. In the end what worked was a simple:</p>
<pre class="prettyprint lang-cl">(defparameter *cookies*
'(("PHPSESSID" . "4s5u76vufh0gt9hs6mrmjpioi0")
("__cfduid" . "dc683e3c2eb82b6c050c1446d5aa203dd1376731139271")))
(defmethod authorize ((obj saltybet-browser) &key)
(clear-cookies obj)
(loop for (name . value) in *cookies*
do (add-cookie obj (make-instance 'drakma:cookie :name name :value value
:domain ".saltybet.com"))))
</pre>
<p>How did I get these values? I just copy-pasted them from my Firefox! They were good for a few days, so it wasn’t much hassle at all. Sometimes a stupid solution is the most effective one.</p>https://readevalprint.tumblr.com/post/95817212328https://readevalprint.tumblr.com/post/95817212328Tue, 26 Aug 2014 07:45:00 -0400common lisplispprogrammingredditsaltybetWeb scraping with Common Lisp: introduction<p>Web scraping is the process of collecting data from websites. There is a lot of data publically accessible on the Internet and sometimes you want to do something with that data programmatically. Instead of manually copying and pasting stuff from websites to some sort of spreadsheet, you might as well write a script that does it for you. The process depends on the way the data is embedded into a target website. In the usual case you have to perform the following:</p>
<ol>
<li>Use a <em>web client</em> to download the contents of a webpage.</li>
<li>Convert these contents (a string) into some sort of internal representation in your programming language (usually a tree of HTML elements).</li>
<li>Find and extract the data you need.</li>
</ol>
<p>Common Lisp has necessary libraries for each of those steps, but not a one single library to put everything together, which is why I threw together <a href="https://github.com/tshatrov/webgunk">Webgunk</a>, which is a mish-mash of various libraries and helper functions to make web scraping easier. It doesn’t have a stable API or anything and I’m still adding new things to it.</p>
<h3>Web client</h3>
<p>Edi Weitz’s <a href="http://weitz.de/drakma/">DRAKMA</a> is the most popular HTTP client for Common Lisp, and it works fine in practice. One small problem I had with it is that its http-request function sometimes returns a string, and sometimes an octet array. Either way, I like strings much more, so I wrote a wrapper around it using <a href="http://weitz.de/flexi-streams/">FLEXI-STREAMS</a>:</p>
<pre class="prettyprint lang-cl">(defun http-request (uri &rest args)
"A wrapper around DRAKMA:HTTP-REQUEST which converts octet array
which it sometimes returns to normal string"
(let* ((result-mv (multiple-value-list (apply #'drakma:http-request uri `(,@args :cookie-jar ,*webgunk-cookie-jar*))))
(result (car result-mv)))
(apply #'values
(if (and (arrayp result)
(equal (array-element-type result) '(unsigned-byte 8)))
(flexi-streams:octets-to-string result)
result)
(cdr result-mv))))
</pre>
<p>You might notice it preserves all the secondary return values as well.</p>
<h3>Parsing HTML</h3>
<p>I’m using <a href="http://common-lisp.net/project/closure/closure-html/examples.html">Closure HTML</a> parser (not to be confused with 1000 other things called Clo[s/j/z]ure) to convert the resulting string into a Lispy representation of DOM tree (<a href="http://common-lisp.net/project/cxml/dom.html">CXML-DOM</a>).</p>
<pre class="prettyprint lang-cl">(defun parse-url (url &rest args)
"Parse HTTP request response to CXML-DOM"
(let ((response (apply #'http-request url args)))
(chtml:parse response (cxml-dom:make-dom-builder))))
</pre>
<h3>Finding your data</h3>
<p>It is possible to use the standartized DOM API to find the required elements in the resulting tree (and it’s worth knowing it), but really, most of the time you want to just use a CSS selector to grab the elements you need. This is where <a href="https://github.com/AccelerationNet/css-selectors">CSS-SELECTORS</a> library comes in handy.</p>
<pre class="prettyprint lang-cl">(let ((document (parse-url "http://www.google.com/search?q=something")))
(css:query "h3.r a" document))
</pre>
<p>returns a list of links from a Google search.</p>
<h3>Extracting your data</h3>
<p>Getting a text value of a HTML element isn’t as easy as you might think. Because it can have other elements inside of it, you must recursively walk and join all the text nodes. There are also a bunch of rules regarding whitespace which must be stripped correctly from the resulting string.</p>
<p>This is what the function node-text in Webgunk does:</p>
<pre class="prettyprint lang-cl">(defun node-text (node &rest args &key test (strip-whitespace t))
(let (values result)
(when (or (not test) (funcall test node))
(dom:do-node-list (node (dom:child-nodes node))
(let ((val (case (dom:node-type node)
(:element (apply #'node-text node args))
(:text (dom:node-value node)))))
(push val values))))
(setf result (apply #'concatenate 'string (nreverse values)))
(if strip-whitespace (strip-whitespace result) result)))
</pre>
<p>It calls strip-whitespace which is just a bunch of regex replacements (see full source code <a href="https://github.com/tshatrov/webgunk/blob/master/webgunk.lisp">here</a>).</p>
<p>Another place where the data can be hidden is HTML attributes. Fortunately, dom:get-attribute pretty much solves this problem. For example: (dom:get-attribute link “href”) returns href attribute of a node.</p>
<p>That’s it for today. In the next installment I’ll probably discuss authentication and other fun stuff you can do.</p>https://readevalprint.tumblr.com/post/80764585017https://readevalprint.tumblr.com/post/80764585017Wed, 26 Mar 2014 05:05:00 -0400common lisplispprogrammingweb scrapingdomhtmldrakmaPrint a tree in Common Lisp (really cheaply)<p>Just a cheap way to print a tree structure to your REPL.</p>
<pre class="prettyprint lang-cl">(defun print-tree (tree &optional (offset 0))
(loop for node in tree do
(terpri)
(loop repeat offset do (princ " |"))
(format t "-~a" (car node))
(print-tree (cdr node) (1+ offset))))
>>> (print-tree
'(("branch1" ("node1") ("node2"))
("branch2" ("subbranch" ("node3") ("node4"))
("node5"))))
-branch1
|-node1
|-node2
-branch2
|-subbranch
| |-node3
| |-node4
|-node5
</pre>https://readevalprint.tumblr.com/post/80698129817https://readevalprint.tumblr.com/post/80698129817Tue, 25 Mar 2014 15:13:00 -0400common lisplispprogrammingcodeGIMP scripting with Script-Fu: namespaces<p>A popular open-source image editor GIMP can be scripted using several programming languages, usually Scheme or Python. Scheme interpreter is called Script-Fu and is based on Tinyscheme dialect which implements a subset of R5RS standard, as well as some extensions (like macros).</p>
<p>GIMP docs have <a href="http://docs.gimp.org/en/gimp-using-script-fu-tutorial.html">a decent tutorial</a> to get started, and the library of GIMP-specific functions is well documented via Procedure Browser (pdb). However many things about the language itself are shrouded in mystery. Tinyscheme documentation is rather sparse and there exist hidden features that aren’t obvious. This is the first post in the series of posts exploring these hidden features.<br/></p>
<h3>Namespaces</h3>
<p>When you write a script it is natural to create a bunch of helper functions. At the time I was writing <a href="http://registry.gimp.org/node/26501">Animstack</a>, I didn’t realize, but functions defined at toplevel are shared between <em>all</em> scripts. This isn’t as nice as you might think, because an independent script might totally overwrite your function leading to bugs. When I was writing my second script, <a href="http://registry.gimp.org/node/27601">BgMask</a>, I took this into account and started giving prefixes to all my helper functions (their names mostly start with “bgmask-”). However this too is a suboptimal solution. What I really wanted was namespaces. And suprisingly, Tinyscheme actually does have them!</p>
<pre> (define my-namespace (make-environment
;;; define your functions here
(define (my-function arg)
...)
(define (blah-blah img drw ....)
...
(my-function ...))
))
(define script-fu-blah-blah my-namespace::blah-blah)
;;; register your script here
</pre>
<p>Inside make-environment you can define as many functions as you want and they will not clash with similarly named functions in other scripts (as long as the namespaces are named differently!) Outside of it, use double-colon syntax to access symbols defined in the environment. Pretty cool, huh? Why doesn’t Javascript have something like this?</p>https://readevalprint.tumblr.com/post/80686273423https://readevalprint.tumblr.com/post/80686273423Tue, 25 Mar 2014 12:38:24 -0400gimpscript-fuschemeprogramminglispFuture topics<p>This is my programming blog. I am the author of <a href="http://animstack.tumblr.com/">Animstack</a> and other less interesting stuff. As you can guess from the blog’s title, I like Lisp and interactive programming. For my day job I program in Python. By the way, I am for hire ;)</p>
<ul>
<li>web scraping in Common Lisp</li>
<li>programming Script-Fu scripts for GIMP</li>
<li>other stuff, but mostly Lisp</li>
</ul>https://readevalprint.tumblr.com/post/80561033086https://readevalprint.tumblr.com/post/80561033086Mon, 24 Mar 2014 04:29:01 -0400common lispgimpscript-fuschemeprogrammingGetting file stats for Unicode filenames in SBCL<p>I have a homebrew directory synchronisation tool, called <a href="https://github.com/tshatrov/cl-sync">CL-SYNC</a> which I use to interactively synchronise my data between an external hd which I use while I travel and my desktop. I wrote it for CLISP at first and it’s calling Windows command line commands like “copy /b” so it’s completely unportable to other OS’s (well, Linux has better sync tools I’m sure). Anyway, the main problem with it was that filenames containing non-Latin, non-Cyrillic characters didn’t work well with CLISP for some reason. At some point I discovered that SBCL surprisingly had great Unicode support for Windows filenames, so I decided to port CL-SYNC to SBCL. However I encountered a few problems while doing it.</p>
<p>To track modify dates of files, I needed an analog of CLISP’s os:file-info. SBCL’s sb-posix:stat was broken on Unicode filenames under Windows. After looking at sb-posix source code and Microsoft’s docs of Windows API, I discovered it needs to call the function _wstat instead of _stat. So I did this:</p>
<pre> (sb-posix::define-stat-call "_wstat"
sb-posix::pathname sb-posix::filename
(function sb-posix::int (sb-posix::c-string :external-format :ucs-2)
(* sb-posix::alien-stat)))
</pre>
<p>This creates a function sb-posix::wstat that works correctly with Unicode filenames. I filed a bug to SBCL, but it’s yet to be resolved to my knowledge.</p>https://readevalprint.tumblr.com/post/80560377648https://readevalprint.tumblr.com/post/80560377648Mon, 24 Mar 2014 04:10:45 -0400common lisplispprogrammingsbclInstalling Common Lisp (SBCL) on Windows, the guide (work in progress)<p><script src="https://gist.github.com/tshatrov/6765120.js" type="text/javascript"></script></p>https://readevalprint.tumblr.com/post/80559559784https://readevalprint.tumblr.com/post/80559559784Mon, 24 Mar 2014 03:49:28 -0400common lisplisptutorialprogrammingsbclslimeemacsasdf