----------------------------------------------------------------
WeOCR Server API
Rev.100131 Jan. 31, 2010
Rev.110903 Sept. 3, 2011
Rev.110907 Sept. 7, 2011
----------------------------------------------------------------
1. Introduction to WeOCR API
WeOCR API has been designed so that it can provide various
applications with a very simple means for accessing online OCR
services.
WeOCR API uses HTTP GET and POST methods.
No SOAP nor special protocol is used.
2. Non-interactive use of WeOCR server
WeOCR servers can be used non-interactively as well as via an
HTML page. To do so, you just call the CGI program directly
using a web tool such as cURL.
Example:
$ curl -F userfile=@Image_File_Name \
-F outputencoding="utf-8" \
-F outputformat="txt" \
http://Server_Address/cgi-bin/weocr/submit.cgi >result.txt
You probably need to use --max-time and --connect-timeout
options as well to avoid undesirable blockings of the process.
If you specify outputformat="txt", the first line of the
output data is used as the status line. The first line will be
blank upon successful recognition.
3. Locating the server
To find a WeOCR server suitable for your application, visit
http://weocr.ocrgrid.org/
(There may be some other WeOCR search engine sites.)
Every WeOCR web site has a server spec file "srvspec.xml"
in the site top directory. The CGI program can be located by
looking at the following entry in the spec file.
<ocrserver specversion="1.x">
<svinfo>
<cgi> ... </cgi>
4. Parameters
WeOCR servers accept the following parameters.
The default value is used if the parameter is not specified.
outputencoding: Output Encoding {utf-8, ...} (default: utf-8)
specifies the encoding of the output data.
For example, "utf-8" and "iso-8859-15" specify UTF-8 and
Latin9 (ISO-8859-15), respectively.
At least UTF-8 must be supported by the server.
contentlang: Contents Language {eng, deu, ...} (default: auto)
specifies the language used in the document image in case
the server supports multiple languages. The special value
"auto" allows the server to assume the language supported
by it or to detect automatically the language(s).
The language code is based on ISO 639-3, although some
derivatives may be accepted by some servers.
outputformat: Output Format {html,txt} (default: txt)
specifies the format of the output data.
If the output format is "txt", the first line of the text
data is used as the status line. A blank line represents
"no error".
eclass: Element Class
{auto, page,text_block,text_line,word, character}
(default: auto)
specifies the element class.
If "page" is given, the server assumes that the input data
contains a page image.
If "word" is given, the server assumes that the input data
contains a single word.
"auto" includes everything except "character".
If "auto" is given, the server automatically detects the
element class or just performs OCR in page mode.
"character" specifies multi-character recognition mode.
If this mode is selected, the server recognizes each
character image independently without applying character
segmentation.
See mchar-file-format.txt for details about the input/output
data format.
ntop: The number of top candidates
(default: server specific value)
specifies the maximum number of the top character candidates
in the multi-character recognition mode.
Note that the server may produce less candidates depending
on the results of character recognition.
Some servers may just ignore this parameter and produces
more candidates. The client must read all the candidates
data and discard unnecessary ones.
5. Processing time
The processing time varies depending on the server hardware,
server load, OCR engine, image size, etc. A US Letter / A4 page
image may require more than 60 seconds.
Unless otherwise mentioned, the processing time is limited to
120 seconds on the server side. The client program do not need
to wait for more than this time.
Sending single character images one by one to the server should
be avoided since it is quite inefficient. Packing some character
images into a file is recommended in the "character" mode.
--