oasis_archiv


Automatic Indexing

Bitte wie folgt formatieren:

XXX

wird zu

XXX

CCC

wird zu CCC

geht leider wegen der Tabellen nicht automatisch

Description

The content-based indexing extension is optional and used only when we want the DB-Adapter to support content-based indexing.

Indexing<h1> Inputs The content-based indexing extension should be launched as: ~/index/index.pl dbname Where dbname is the name for the database where hte daily cache is available and where tables for storing keywords are prepared.

Status of the database entries

The indexing status of the media in the database is beeing held in the database table status. The status of the media desribes wether it was indexed before and what was the result of the indexing. The description of the status values are beeing held in the codes table in teh database. It is given below.

Status of the media
00 Indexed, keywords generated
10 Warning! Indexed, but no keywords generated
20 Error! Not indexed, no result file
21 Error! Not indexed, media file has 0 size
22 Error! Not indexed, media file not downloaded
23 Error! Not indexed, URL not found
24 Error! Not indexed, URL was empty
25 Error! Not indexed, Media format not supported
30 Not yet indexed

The fiellds in red are not yet implemented or discussed.

Processing

The index.pl script analyses the daily cache for not-yet-indexed artworks and executes tributary recognition engines. The indexing is performed in several stages.

1. In the first stage several configuration tasks are performed

  • A connection is established with an apropriate database
  • A home directory is set
  • A stoplist is loaded

2. From the second stage on the index.pl script is working in a loop, performing actions for each entry in the database.

3. In the third stage it is checked, wether the database entry has a ststus entry in the status table. If a database entry possesses no status, its status is set to asr,ocr,face=30,30,30

4. In the fourth stage it is beeing checked wether the database entry requires indexing. If {asr,ocr,face}>20, then indexing is required.

5. In the fifth stage a directory structure is prepared

6. In the sixth stage a url is beeing obtained from teh database. If there is no URL in the database - a status is beeing set and the procedure returns to stage 2. If the url exists - the file behind the url is beeing downloaded with use of the uget script.

7. In the seventh stage it is beeing checked wether the file was downloaded. If not - the procedure returns to stage 2 and updates the status.

8. In the eight stage a FPS of the file is beeing chcecked with use of mplayer. If the checked fps is equal 0 it is beeing maped to 1 (for need of the recognition algorithms.

9. In the nineth stage it is checked what media has to be extracted from the downloaded file - wether it is frames or soundtrack.

10. In the tenth stage apropriate extraction is beeing performed with use of the usplit script, according to information form stage 9.

11. In the eleventh step it is checked wetehr the extarction was sucessfull. If not - an apropriate status is set. If the extraction was sucessfull - the apropriate recognition mechanisms are beeing launched.

12. In the twelveth step the keywords from recogntition are beeing processed. The keywords are filtered through a stoplis and put int the database

13. The procedure returns to step 2


Their executables expect only two or three parameters. The first parameter is Wave file (with path) or path to extracted frames. The second parameter if output file (with path). The third parameter (only when the first parameter is path to extracted frames) is fps.

Indexing Engines

The indexing engines have a common interface an are run from within the index.pl script. The common format is as following:

index.pl $engine/bin/$engine $datasource $resultfile $fps

Example: index.pl ocr/bin/ocr ./tmp/index/frames ./data/recognition 25

Outputs

The storage for content-based keywords is located in the following two SQL tables:

  • keywords
  • timecodes

Content-based keywords are stored in SQL tables of the DB-Adapter's database. There is a relation defined between these two tables.

Table keywords

idkeywordoasisid
1blackoasisid1
2spiraloasisid1
3blackoasisid2
4bhackoasisid3
5macarthturoasisid3

Table timecodes

id2timecode
12040
18840
120476
240
3845
323476
4440
4842
410876
51023

Relations

The keywords and timecodes tables are related each-other. Items in the id column in the keywords table are foreign keys for timecodes table, represented by id2 column.

Searching

Inputs

The searching script should be launched as:

~/index/is/query.php dbname tablename keyword-1 keyword-2 ... keyword-n

Where dbname and tablename specifies where to store the results, and keyword-1, keyword-2, ..., keyword-n specify query keywords.

Processing

Every search for recognition-based keywords uses fuzzy-logic pattern-matching. Therefore no standard SQL LIKE can be used. The search is executed at the keywords table first. At first, from the keywords table, keywords maching similar length are selected. Then some custom special PostgreSQL function extracts (from the group of previously selected keywords) keywords fuzzy matching the requested pattern. Next, using SQL relations, OASIS-IDs and Time-Codes matching extracted keywords are selected. The list is complemented with weights (always integer value), being a sum of a distance (fuzzy-matching result) and number of occurances of keywords in each row.

Outputs

Requirements

The server providing an account for a content-based indexing extension should have installed regular Linux packages, the appropriate DB-Adapter as well as the following packages:

  1. Perl
  2. Wget
  3. OpenSSH
  4. Expect
  5. Mplayer with avisynth.dll, AVIFIL32.dll,DevIL.dll, GDI32.dll, MSACM32.dll

ASR

The ASR requires a separate Windows XP SP2 server with:

  1. Microsoft SAPI (Microsoft Speech SDK 5.1)
  2. Cygwin – SSH server and RSA keys authentication
  3. PHP 5.0.4
  4. Mplayer with codecs

FR

Linux packages:

  1. ImageMagick

OCR

Linux packages:

  1. ImageMagick
  2. PerlMagick
  3. GOCR

Adaptation Tools

Uget

Command

/bin/uget/uget.pl URL output-document

Options

URL - link to resource (http, sftp, rtsp) with optional login elements
output-document - name of local file for selected URL resource

Example

/bin/uget/uget.pl URL

Uget opens connection to the selected resource (URL) and downloads corresponding media file to selected file (output-document).

Usplit

Command

/bin/usplit/usplit.sh mediafile -v/-a

Options

mediafile - GIF, JPEG, PNG, TIFF, PPM, PDF or videofile supported by mplayer (AVI, MOV, MPG, RM)
-v - split to frames
-a - split to audio (valid only for video files)

Example

/bin/usplit/usplit.sh test1.avi -v

Usplit splits video or image file into frames for later visual processing (ocr, fr). For video files it is also possible to extract audio track for audio processing (asr).

Ucheck

Command

/bin/ucheck/ucheck.sh

Options

(none)

Ucheck is a quick test whether required packages are present in the system. If any problem is found (package not found or found in obsolete version) error level is set. Also corresponding log is shown every run.


Location

The files for a content-based indexing extension should be placed in ~/index The directory structure is organised as follows:

  1. ~/index/asr - Automatic Speech Recognition system files
  2. ~/index/fr - Face Recognition system files
  3. ~/index/ocr - Optical Character (text) Recognition system files
  4. ~/index/tmp - link/directory for temporary data
  5. ~/index/data/recognition - link/directory for recognition results

= Installation=

  Login