Automatic Indexing
Bitte wie folgt formatieren:
XXX wird zu XXX
CCC wird zu CCC
geht leider wegen der Tabellen nicht automatisch
Description
The content-based indexing extension is optional and used only when we want the DB-Adapter to support content-based indexing.
Indexing<h1>
Inputs
The content-based indexing extension should be launched as:
~/index/index.pl dbname
Where dbname is the name for the database where hte daily cache is available and where tables for storing keywords are prepared.
Status of the database entries
The indexing status of the media in the database is beeing held in the database table status. The status of the media desribes wether it was indexed before and what was the result of the indexing. The description of the status values are beeing held in the codes table in teh database. It is given below.
Status of the media
| 00 | Indexed, keywords generated
|
| 10 | Warning! Indexed, but no keywords generated
|
| 20 | Error! Not indexed, no result file
|
| 21 | Error! Not indexed, media file has 0 size
|
| 22 | Error! Not indexed, media file not downloaded
|
| 23 | Error! Not indexed, URL not found
|
| 24 | Error! Not indexed, URL was empty
|
| 25 | Error! Not indexed, Media format not supported
|
| 30 | Not yet indexed
|
The fiellds in red are not yet implemented or discussed.
Processing
The index.pl script analyses the daily cache for not-yet-indexed artworks and executes tributary recognition engines. The indexing is performed in several stages.
1. In the first stage several configuration tasks are performed
- A connection is established with an apropriate database
- A home directory is set
- A stoplist is loaded
2. From the second stage on the index.pl script is working in a loop, performing actions for each entry in the database.
3. In the third stage it is checked, wether the database entry has a ststus entry in the status table.
If a database entry possesses no status, its status is set to asr,ocr,face=30,30,30
4. In the fourth stage it is beeing checked wether the database entry requires indexing. If {asr,ocr,face}>20, then indexing is required.
5. In the fifth stage a directory structure is prepared
6. In the sixth stage a url is beeing obtained from teh database. If there is no URL in the database - a status is beeing set and the procedure returns to stage 2. If the url exists - the file behind the url is beeing downloaded with use of the uget script.
7. In the seventh stage it is beeing checked wether the file was downloaded. If not - the procedure returns to stage 2 and updates the status.
8. In the eight stage a FPS of the file is beeing chcecked with use of mplayer. If the checked fps is equal 0 it is beeing maped to 1 (for need of the recognition algorithms.
9. In the nineth stage it is checked what media has to be extracted from the downloaded file - wether it is frames or soundtrack.
10. In the tenth stage apropriate extraction is beeing performed with use of the usplit script, according to information form stage 9.
11. In the eleventh step it is checked wetehr the extarction was sucessfull. If not - an apropriate status is set. If the extraction was sucessfull - the apropriate recognition mechanisms are beeing launched.
12. In the twelveth step the keywords from recogntition are beeing processed. The keywords are filtered through a stoplis and put int the database
13. The procedure returns to step 2
Their executables expect only two or three parameters. The first parameter is Wave file (with path) or path to extracted frames. The second parameter if output file (with path). The third parameter (only when the first parameter is path to extracted frames) is fps.
Indexing Engines
The indexing engines have a common interface an are run from within the index.pl script.
The common format is as following:
index.pl $engine/bin/$engine $datasource $resultfile $fps
Example:
index.pl ocr/bin/ocr ./tmp/index/frames ./data/recognition 25
Outputs
The storage for content-based keywords is located in the following two SQL tables:
Content-based keywords are stored in SQL tables of the DB-Adapter's database. There is a relation defined between these two tables.
Table keywords
| id | keyword | oasisid
|
| 1 | black | oasisid1
|
| 2 | spiral | oasisid1
|
| 3 | black | oasisid2
|
| 4 | bhack | oasisid3
|
| 5 | macarthtur | oasisid3
|
Table timecodes
| id2 | timecode
|
| 1 | 2040
|
| 1 | 8840
|
| 1 | 20476
|
| 2 | 40
|
| 3 | 845
|
| 3 | 23476
|
| 4 | 440
|
| 4 | 842
|
| 4 | 10876
|
| 5 | 1023
|
Relations
The keywords and timecodes tables are related each-other. Items in the id column in the keywords table are foreign keys for timecodes table, represented by id2 column.
Searching
Inputs
The searching script should be launched as:
~/index/is/query.php dbname tablename keyword-1 keyword-2 ... keyword-n
Where dbname and tablename specifies where to store the results, and keyword-1, keyword-2, ..., keyword-n specify query keywords.
Processing
Every search for recognition-based keywords uses fuzzy-logic pattern-matching. Therefore no standard SQL LIKE can be used. The search is executed at the keywords table first. At first, from the keywords table, keywords maching similar length are selected. Then some custom special PostgreSQL function extracts (from the group of previously selected keywords) keywords fuzzy matching the requested pattern. Next, using SQL relations, OASIS-IDs and Time-Codes matching extracted keywords are selected. The list is complemented with weights (always integer value), being a sum of a distance (fuzzy-matching result) and number of occurances of keywords in each row.
Outputs
Requirements
The server providing an account for a content-based indexing extension should have installed regular Linux packages, the appropriate DB-Adapter as well as the following packages:
- Perl
- Wget
- OpenSSH
- Expect
- Mplayer with avisynth.dll, AVIFIL32.dll,DevIL.dll, GDI32.dll, MSACM32.dll
ASR
The ASR requires a separate Windows XP SP2 server with:
- Microsoft SAPI (Microsoft Speech SDK 5.1)
- Cygwin – SSH server and RSA keys authentication
- PHP 5.0.4
- Mplayer with codecs
FR
Linux packages:
- ImageMagick
OCR
Linux packages:
- ImageMagick
- PerlMagick
- GOCR
Adaptation Tools
Uget
Command
- /bin/uget/uget.pl URL output-document
Options
- URL - link to resource (http, sftp, rtsp) with optional login elements
- output-document - name of local file for selected URL resource
Example
- /bin/uget/uget.pl URL
Uget opens connection to the selected resource (URL) and downloads corresponding media file to selected file (output-document).
Usplit
Command
- /bin/usplit/usplit.sh mediafile -v/-a
Options
- mediafile - GIF, JPEG, PNG, TIFF, PPM, PDF or videofile supported by mplayer (AVI, MOV, MPG, RM)
- -v - split to frames
- -a - split to audio (valid only for video files)
Example
- /bin/usplit/usplit.sh test1.avi -v
Usplit splits video or image file into frames for later visual processing (ocr, fr). For video files it is also possible to extract audio track for audio processing (asr).
Ucheck
Command
- /bin/ucheck/ucheck.sh
Options
- (none)
Ucheck is a quick test whether required packages are present in the system. If any problem is found (package not found or found in obsolete version) error level is set. Also corresponding log is shown every run.
Location
The files for a content-based indexing extension should be placed in ~/index
The directory structure is organised as follows:
- ~/index/asr - Automatic Speech Recognition system files
- ~/index/fr - Face Recognition system files
- ~/index/ocr - Optical Character (text) Recognition system files
- ~/index/tmp - link/directory for temporary data
- ~/index/data/recognition - link/directory for recognition results
= Installation=
|