---------------------------------------------------------------- WeOCR Server API Rev.100131 Jan. 31, 2010 Rev.110903 Sept. 3, 2011 Rev.110907 Sept. 7, 2011 ---------------------------------------------------------------- 1. Introduction to WeOCR API WeOCR API has been designed so that it can provide various applications with a very simple means for accessing online OCR services. WeOCR API uses HTTP GET and POST methods. No SOAP nor special protocol is used. 2. Non-interactive use of WeOCR server WeOCR servers can be used non-interactively as well as via an HTML page. To do so, you just call the CGI program directly using a web tool such as cURL. Example: $ curl -F userfile=@Image_File_Name \ -F outputencoding="utf-8" \ -F outputformat="txt" \ http://Server_Address/cgi-bin/weocr/submit.cgi >result.txt You probably need to use --max-time and --connect-timeout options as well to avoid undesirable blockings of the process. If you specify outputformat="txt", the first line of the output data is used as the status line. The first line will be blank upon successful recognition. 3. Locating the server To find a WeOCR server suitable for your application, visit http://weocr.ocrgrid.org/ (There may be some other WeOCR search engine sites.) Every WeOCR web site has a server spec file "srvspec.xml" in the site top directory. The CGI program can be located by looking at the following entry in the spec file. <ocrserver specversion="1.x"> <svinfo> <cgi> ... </cgi> 4. Parameters WeOCR servers accept the following parameters. The default value is used if the parameter is not specified. outputencoding: Output Encoding {utf-8, ...} (default: utf-8) specifies the encoding of the output data. For example, "utf-8" and "iso-8859-15" specify UTF-8 and Latin9 (ISO-8859-15), respectively. At least UTF-8 must be supported by the server. contentlang: Contents Language {eng, deu, ...} (default: auto) specifies the language used in the document image in case the server supports multiple languages. The special value "auto" allows the server to assume the language supported by it or to detect automatically the language(s). The language code is based on ISO 639-3, although some derivatives may be accepted by some servers. outputformat: Output Format {html,txt} (default: txt) specifies the format of the output data. If the output format is "txt", the first line of the text data is used as the status line. A blank line represents "no error". eclass: Element Class {auto, page,text_block,text_line,word, character} (default: auto) specifies the element class. If "page" is given, the server assumes that the input data contains a page image. If "word" is given, the server assumes that the input data contains a single word. "auto" includes everything except "character". If "auto" is given, the server automatically detects the element class or just performs OCR in page mode. "character" specifies multi-character recognition mode. If this mode is selected, the server recognizes each character image independently without applying character segmentation. See mchar-file-format.txt for details about the input/output data format. ntop: The number of top candidates (default: server specific value) specifies the maximum number of the top character candidates in the multi-character recognition mode. Note that the server may produce less candidates depending on the results of character recognition. Some servers may just ignore this parameter and produces more candidates. The client must read all the candidates data and discard unnecessary ones. 5. Processing time The processing time varies depending on the server hardware, server load, OCR engine, image size, etc. A US Letter / A4 page image may require more than 60 seconds. Unless otherwise mentioned, the processing time is limited to 120 seconds on the server side. The client program do not need to wait for more than this time. Sending single character images one by one to the server should be avoided since it is quite inefficient. Packing some character images into a file is recommended in the "character" mode. --