Reading Japanese text from images

May 10, 2014 – Filed as: Offtopic⁶

I’ve been intensively studying Japanese recently¹, and something I have to do quite a lot is look up Japanese words that I see in images. Particularly words that involve a lot of kanji characters.

Last evening I hacked together a program that allows me to select arbitrary Japanese from an image on my screen, performs optical character recognition (OCR) on it, and outputs it as regular Unicode text. This regular text can then be looked up in an online Japanese-English dictionary, such as the excellent jisho.org.

Let’s take the following sentence:

Japanese sentence with lots of kanji

The first word in that sentence already has 4 kanji characters, which would normally take a long time to look up individually.²

With my program, I can take a screenshot of this sentence as it appears on my screen and the program will immediately output the following text:

痴話喧嘩は維続中か

Then I can lookup the words as usual from jisho.org’s Words search:

痴話喧嘩 = lover’s quarrel
維続 = ?
中 = inside; in (a space or building)

Character recognition still isn’t perfect, as demonstrated by the character marked in pink above, which wasn’t decoded correctly.

Nevertheless even having a partially correct character is useful, as I can still use jisho.org’s Kanji search to break it down into its component radicals:

維 = 小 + 幺 + 糸 + 隹

And then use a few of the correct radicals in the Kanji by Radicals search to quickly find the correct character:

継 = 小 + 幺 + 糸 + ｜ + 米

And so the original word can now be looked up correctly:

継続 = continuation

How it Works

The program operates by waiting for you to take a screenshot using Command-Shift-4 (on the Mac), observing the new screenshot appearing on the desktop, and feeding it to the nhocr tool, which reads the Japanese text inside the screenshot and prints it.

Source Code

#!/usr/bin/env python
# 
# nhocr_desktop.py
#
# Waits for screenshots to appear on the desktop,
# scans them for Japanese text using OCR,
# and prints out the text that was decoded from the image.
# 
# This script is useful for rapidly converting on-screen Japanese
# text to actual text that can be looked up.
#
# Prerequisites:
#   * nhocr
#   * ImageMagick
#   * Python 2.7 - maybe 2.6 okay too
#   * watchdog
# 
# @author David Foster
# 

import os.path
import subprocess
import time
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler

class ScreenshotFileCreationEventHandler(FileSystemEventHandler):
    def on_created(self, event):
        if not event.is_directory:
            if 'Screen Shot' in os.path.basename(event.src_path):
                # Convert screenshot to PPM format with ImageMagick
                subprocess.check_call([
                    'convert', event.src_path, '/tmp/line.ppm'])

                line = subprocess.check_output([
                    'nhocr', '-line', '/tmp/line.ppm', '-o', '-'])
                print line.decode('utf-8'),

event_handler = ScreenshotFileCreationEventHandler()

observer = Observer()
observer.schedule(event_handler, '/Users/me/Desktop', recursive=False)
observer.start()

print 'Waiting for screenshots on the desktop to analyze...'
try:
    while True:
        time.sleep(1)
except KeyboardInterrupt:
    observer.stop()
observer.join()

All this studying is one reason I haven’t posted in a while.↩
Individual kanji can be looked up by their component radicals at jisho.org’s Kanji by Radicals search. Unfortunately it can take a few minutes to lookup each character.↩