Home
Textweiser SDK User Manual - Lingua
Contents
1. 2 4 3 3 Classification Options tw classify opt t o e 4 3 4 Configuration Data tw Conf uote oo PR Oe AAA EVE 24 Fco Referencer de dere eee ete e NAE mode pie Des ADA e 4 4 1 tw add category and tw delete category 4 4 2 tw backup db and tw_restore_db o 4 4 3 tw classify tw_classify_v2 tw classify file and tw classify file v2 4 4 4 tw create db and bw erase db o 44 5 IW TIG e toes A ure A A SN 4 4 6 we ree Calegories A A A Y A A ph cad t E A EE EE EE 44 8 TW free proD I ss Sup EEN 4 4 9 IW gel calegorigsi ue io a a un eed ed abs dr CR LOU MAU BOITE ST NETS kote CLE LLLI 4 4 11 tw learn and bw learn file llle 44 12 iw optimize db a o sem eS B RICE geg be Aeren E A 441905 parse OOO A manias atu Sie ur ROS i a Boel OD ROO Op X d ndo or En at 4 4 14 tw rename category eo eras RR a RR RO RR RUEDA 44315 1W SITGI TOI or secu ele Rvs e Gom Roe pe a EOS RP T m CAT o aper de 4 4 16 tw unlearn and tw unlearn file ln AMAN MOTION a o Geh ponte e m ox EHE A T OP QR EE A AAG AW VelrSIOD SO shed im Sve qo etos te e ev Oven EE 2 5 Error Handling a e na ancho hee Eee ey int deux t pO s es Bel a De ter Ba de A 4 5 1 tw errno t Named Error ConstantS 4 6 Hints on Application Development 28 4 6 1 Determining Textweise
2. Values Description Comment off no Disable encryption Default on yes Enable encryption trust cert Trust certificate In addition to on or yes Figure 10 Valid Values for encrypt The value trust cert has to follow either on or yes and may be separated by a comma and or whitespace For example on trust cert For further information on encryption please refer to chapter 3 4 on page 10 Example configuration file Microsoft SQL Server host dbsrv local user TAS passwd secret db_name Textweiser encrypt on trust cert instance SQLEXPRESS port not set gt use default Lingua Systems Textweiser SDK v1 3 0 Page 15 4 3 Important Data Structures The data structure tw_errno_t is described in a separate chapter on error handling chapter 4 5 page 26 4 3 1 Textweiser Object tw_t The data structure tw_t contains data that is exclusively used by Textweiser internally No application should evaluate or change the data directly First you should assign the macro TW_INITIALIZER to any variable of type tw_t on declaration in order to initialize it with its default values The function tw_init then initializes a tw_t object for use within the operating environment and connects to the database A tw_t object is expected as an argument by almost every Texiweiser function Use tw_free to free the memory allocated by this object and disconnect from the
3. tw_errno_t tw create db coust tw config t cfg tw_errno_t tw erase db corst tw_config_t cfg The function tw_create_db creates a new Textweiser database and initializes it with all necessary structures tw_erase_db deletes all data from a Textweiser database The functions expect a pointer to a tw_config_t data structure that contains all settings that are necessary to connect to the database Detail on tw_config_t are given in chapter 4 3 4 on page 17 The functions return an error code that indicates whether the respective function succeeded or an error occurred For details on error handling see chapter 4 5 on page 26 Both functions are thread safe and thus can be used by more than one thread at a time tw_erase_db does not remove the file an SQLite database is stored in Page 20 Lingua Systems Textweiser SDK v1 3 0 4 4 5 tw_free void tw_free tw_t tw This function closes an open connection to a database and frees all resources used by a Textweiser object The function takes a pointer to an initialized Textweiser object tw_t as an argument see chapter 4 3 1 on page 16 The function is thread safe and thus can be used by more than one thread at a time allocated memory and close the database connection o This function has to be used on any supported operating system to free 4 4 6 tw free categories void tw free categories char cats This function frees the memory used by a catego
4. database 4 3 2 Classification Result tw_prob_t The classification results of the functions tw_classify tw_classify_v2 tw_classify_file and tw_classify_file_v2 are given as an array of pointers to a tw_prob_t data structure and stored to a user definable memory location The end of the array is marked with a NULL element Each tw_prob_t data structure contains the name of a category and the probability the document belongs to this category The elements of the array are sorted descending by probability The formal definition of the data structure is as follows typedef struct char category float probability tw_prob_t 4 3 3 Classification Options tw_classify_opt_t The functions tw_classify_v2 and tw_classify_file_v2 allow to pass options that influence classification Both functions equal their non v2 counterparts if TW CLASSIFY DEFAULT is passed as an option The data structure comprises the following constants IMutually exclusive options are presented in the same color Page 16 Lingua Systems Textweiser SDK v1 3 0 Default TW_CLASSIFY_DEFAULT Use defaults TW_CLASSIFY_PABS Use absolute probabilities y TW_CLASSIFY_PDIST Use distributed probabilities no TW_CLASSIFY_SPARF On equal probability sort parent yes categories first TW_CLASSIFY_SSUBF On equal probability sort subcategories first Figure 11 Constants of the tw_classify_opt_t Data Structure For a detailed expl
5. EXIT_SUCCESS else fprintf stderr Failed to classify sin tw_strerror rv return EXIT_FAILURE return EXIT_SUCCESS The following output shows an example execution of the application Category Economy Markets gt 100 00 Category Holidays gt 13 02 Lingua Systems Textweiser SDK v1 3 0 Page 39 C References gt Lingua Systems Textweiser SDK product website http www lingua systems com text classification Textweiser SDK software specification for version 1 3 0 The Unicode Standard http www unicode org gt RFC 2279 UTF 8 a transformation format of ISO 10646 http www ietf org rfc rfc2279 txt SQLite http www sqlite org Microsoft SQL Server http www microsoft com sqlserver MSDN Installing SQL Server Native Client http msdn microsoft com en us library ms131321 aspx MSDN Encrypting Connections to SQL Server http msdn microsoft com en us library ms189067 aspx MSDN Using Encryption Without Validation http msdn microsoft com en us library ms131691 aspx http www lingua systems com text classification Page 40 Lingua Systems Textweiser SDK v1 3 0 Index A H application programming interface API 12 hierarchy oia Ee RE En 8 applications see commandline applications mono hierarchical ooooooooooo 8 A DI NN POE 9 B o A ERES 9 Ee A ET 19 35 l C installing the
6. It has to meet the following conditions 3 1 The category depth may not change 2 The relation to the direct top level category must stay the same Error code TW_ECONSTR indicates that one of these conditions is violated The function can only be used to rename a category and is not suitable for moving a category within a mono hierarchical structure If you want to move a category and change the structure delete the category add it at a new position and train it again 4 4 15 tw_strerror const char tw_strerror tw_errno_t errnum The function takes an error indicator tw_errno_t as an argument and returns a pointer to a read only string const char containing the English error message A list of all error codes and descriptions are given in chapter 4 5 1 on page 27 The memory of the returned string must not be freed The function is thread safe and thus can be used by more than one thread at a time 4 4 16 tw_unlearn and tw_unlearn_file tw_errno_t tw_unlearn tw_t tw const char cat const char str tw_errno_t tw_unlearn_file tw_t tw const char cat const char path These functions analyse an input document and undo a previously done learning operation The functions take a pointer to an initialized Textweiser object tw_t as a first argument see chapter 4 3 1 on page 16 The second parameter cat denotes the category you trained erroneously before The third argument is the document provided as a sting t
7. doc examples example_add learn c example cfg example classify c example get categories c example init c example_parse_config c include tw h o os libtw so libtw so 10 libtw so 1 1 0 2 3 Installing the Software Textweiser SDK is provided as a compressed archive either in Zip or tar gz form depending on the target platform To install the software just unpack the archive to a directory of your choice and add the library and header files to your project 2 4 Deinstalling the Software To deinstall the software just remove the directory you unpacked Textweiser SDK to Lingua Systems Textweiser SDK v1 3 0 Page 7 3 Hints on the Usage of Textweiser Before putting a text classifier into operation it is necessary to plan the deployment first If planning of the resulting structure of categories is accomplished you can start preparing the text classifier Nevertheless it is possible to change the structure during operation Textweiser allows to add new categories or rename and delete existing ones One of the most important factors for accuracy of the classification results is the training of the classifier During training the system learns the characteristics of representative documents for each category It is recommended to choose at least ten documents each When training is complete the software can be used to classify unknown documents Textweiser provides a list of categories a document may belong to along
8. functions are thread safe and thus can be used by more than one thread at a time 4 4 12 tw_optimize_db tw errno t tw optimize db tw t tw This function optimizes a Textweiser database with regard to performance and accuracy The function s argument is a pointer to an initialized Textweiser object tw t see chapter 4 3 1 on page 16 The function returns an error code that indicates whether the function succeeded or an error occurred For details on error handling see chapter 4 5 on page 26 The function is thread safe and thus can be used by more than one thread at a time This function should be invoked when training with a set of documents is accomplished It has to be called whenever the structure of the system changed for example when a category was deleted tw optimize updates the database so that performance and accuracy increase 4 4 13 tw parse config tw_errno_t tw parse contig const char path tw_config_t config The function parses a configuration file and stores its content to a data structure For the usage of a configuration file please refer to chapter 4 2 on page 15 Any value that is not given is set to NULL and 0 respectively The first argument is the path to the configuration file path The second argument config is a pointer to a tw_config_t data structure as described in chapter 4 3 4 on page 16 Every variable of tw_config_t type should be initialized using the TW_CONFIG_INITIALIZER m
9. self signed certificates The variables and options will be described in later chapters of this manual again on pages 15 configuration file 17 variables und flags and 29 options of the commandline applications For further information on the configuration of the database server please refer to your Microsoft SQL Server documentation and the links relevant articles on MSDN as referred to in appendix C on page 40 Lingua Systems Textweiser SDK v1 3 0 Page 11 4 Application Programming Interface 4 1 Overview Textweiser SDK s C C library provides an API that is intuitive to use and allows integration into applications easily All functions and data structures are prefixed tw_ to avoid confusions and collisions with other third party library functions and are defined in the header file tw h Input passed to the library is expected to be plain text and encoded in UTF 8 It is recommended to use Textweiser only with supported languages see the software specification Other languages can be processed nevertheless but important tasks of linguistic preprocessing will be missing so getting less accurate results is likely The functions can be divided into five categories administration resource handling learning classification and auxiliaries 4 1 1 Functions for Administration Administration tw_create_db Database Parameters tw_erase_db tw_optimize_db tw add category tw d
10. software ssueeeraseeenaa 7 category structure see hierarchy M COMICIOS oa see encryption classification options see tw classify opt t Microsoft SQL Server VE e SUI E 7 11 classification result see tw prob t Mono hierarchical category structure 8 commandline applications 29 N WAI eta e sol aa fed 30 ce ge Pia 35 named error Conslants sseeecreeeea 27 Wed Visitan 33 P bhw leam eee eee ees 31 configuration file o o oo 15 dilemas NP 9 D calculaba 9 data SI A EE 16 SEN arre e n unde x tw classify opt Loss peor 16 R tw config Te 17 requirements zoe e e i bate eus 7 A es nsssnssnnrrnnnn Seeiw ermo 1 restore Lus nee 19 35 W rel DEER 16 NEE E 16 S database connection 10 15 17 22 29 self signed certificates see encryption deinstalling the software 7 dependencies see requirements T taxonomy ce eco e x EROR OR CR DO urn eer UR RO RT RR 8 E trainirig ien e edet 8 22 31 encryption 00e eee eee 10 15 17 29 etenim 30 aa AA euo fat Ss 11 29 TWD CKD deeg 35 Microsoft SQL Server 005 11 tw classify EEN 39 self signed certificate 11 29 weai scu o ien at 31 self signed certificates 18 tw add category ooooccocccoccccncncco 19 SOL ette Emu Ai E AUS 11 tw_backup_db EE 19 e
11. with their probabilities The number of results can be defined by the application that uses the Textweiser library This way the library may be used to classify a document automatically when choosing one result or provide a list of suggestions to the end user 3 1 Working with Category Structures Textweiser supports both flat and mono hierarchical category structures taxonomies 3 1 1 Flat Category Structures Flat category structures cannot express any hierarchical relations All categories are located on the same level as the following diagram shows Top Level Marketing ES Figure 1 Example of a flat Category Structure A flat structure is easy to plan and implement It is suitable for systems that have a small amount or medium of categories 3 1 2 Mono hierarchical Category Structures Taxonomies Relations between categories can be expressed using mono hierarchical structures taxonomies The relations result in a tree structure with a set of top level and sub level categories Each sub level category may only have one top level category but may itself have several sub level categories Top Level Marketing Sub Level Invoices Correspondence Products PR Support Projects Sub Sub Level Product 1 Figure 2 Example of a mono hierarchical Structure Taxonomy Product 2 AI Archival Page 8 Lingua Systems Textweiser SDK v1 3 0 To handle mono hierarchical structures Text
12. K Learned 1 document of category Sales tw learn v d textweiser sqlt c Sales U projects_3 txt us Processing projecta a a UK Optimizing database Unlearned 1 document of category Sales tw learn v d textweiser sqlt c Projects projects_3 txt us Processing projecta _3 txt UK Learned 1 document of category Projects Page 32 Lingua Systems Textweiser SDK v1 3 0 5 5 tw classify Classify Unknown Documents Unknown documents can automatically be classified using tw classify as soon as the Textweiser database has been initialized with a set of categories and trained using representative documents During classification the unknown documents are analysed and their determined characteristics are compared to those of the trained categories By default tw classify uses a single thread and prints only the most likely category for each document Classify Document s Figure 21 tw classify Classifying Unknown Documents tw classify requires a set of paths to unknown documents as arguments only The number of threads to use for classification may optionally be set using the x or threads option Increasing the number of threads may lead to increased processing speed especially on multicore systems The n or show option allows to specify the number of result categories to be shown along with their determined probabilities The r or distribute option allows to distribute the determined probabilities see chapt
13. User Manual for Textweiser SDK A software to classify text Covers version 1 3 0 NATURAL LANGUAGE PROCESSING Lingua ystems e Textweiser SDK User Manual published April 23 2014 Copyright 2010 2014 Lingua Systems Software GmbH Lingua Systems Software GmbH GerichtsstraBe 42 44649 Herne Germany info lingua systems com All rights reserved especially changing or publishing parts of this manual needs prior written permission of the copyright owner The rights to reproduce and publish unchanged copies in any form to translate or to present the manual are granted Mentioned hard and software as well as companies may be trademarks of their respective owners Use of aterm in this manual should not be regarded as affecting the validity of any trademark or service mark A missing annotation of the trademark may not lead to the assumption that no trademark is claimed and may thus be used freely Great effort has been made in writing this manual However faults cannot be excluded in general For any loss or damages caused or alleged to be caused directly or indirectly by errors or omissions in this manual the authors and the publisher assume no responsibility and cannot be held liable Neither can the authors or the publisher be held liable for the content or changes of content concerning the linked websites The links have been carefully chosen and proved at the preparation of the manual If you have problems using
14. acro The function returns an error code that indicates whether the function succeeded or an error occurred For details on error handling see chapter 4 5 on page 26 The function is thread safe and thus can be used by more than one thread at a time Q A passed config variable is re initialized on any call of the function Any settings that may have been previously stored within will be lost 4 4 14 tw rename category tw errno t tw rename category tw t tw const char cur name const char new name This function renames an existing category in a Textweiser database The function s first argument is a pointer to an initialized Textweiser object tw t see chapter 4 3 1 on page 16 The second and third argument are the current cur name and new category name new name Both category names have to be encoded in UTF 8 and must not exceed a length of 255 bytes each Lingua Systems Textweiser SDK v1 3 0 Page 23 For hints on using mono hierarchical category structures please refer to chapter 3 1 2 on page 8 The function returns an error code that indicates whether the function succeeded or an error occurred For details on error handling see chapter 4 5 on page 26 The function is thread safe and thus can be used by more than one thread at a time Renaming of categories within mono hierarchical category structures taxonomies is only possible if this process does not change the relations between the categories
15. anation of the provided probability calculations have a look at chapter 3 2 on page 9 Whenever a set of classification results shares equal probabilities the options TW_CLASSIFY_SPARF and TW_CLASSIFY_SSUBF determine the order of the results data structure TW_CLASSIFY_SPARF sorts any parent category in front of its subcategories while TW_CLASSIFY_SSUBF provides the opposite sorting subcategories preceding their parent categories The formal definition of the data structure is as follows typedef enum 1i TW CLASSIFY DEFAULT TW CLASSIFY PABS NAC IAS Sve ge Sch TW_CLASSIFY_SPARF TW_CLASSIFY_SSUBF tw_classify_opt_t e e OD rN tz CH 4 3 4 Configuration Data tw_config_t Any variable of type tw_config_t should be initialized on declaration using the macro TW_CONFIG_INITIALIZER A configuration file can be used to provide all database settings The data structure tw_config_t is used by the function tw_parse_config to store all settings parsed from the configuration file and make them accessible to the application The database settings can also be assigned manually Examples for the assignment of the values can be found in the example applications in appendix A and B Whenever settings have been assigned manually tw free config t must not be used The data structure tw config t allows to store the following settings name of the database server host user name user password passwd name of the databa
16. and utilize verbose processing mode First a new database is created and a few categories are added one containing a typing error A category listing is requested afterwards tw admin v d textweiser sqlt C Creating Textweiser tables in textweiser sqlt tw admin v d textweiser sqlt A c Sales Adding category Sales tw admin v d textweiser sqlt A c Projcets Adding category Projcets tw admin v d textweiser sqlt L Categories in textweiser sqlt 01 Projcets 02 Sales The typing error in the category name Projcets will now be fixed by renaming the category tw admin v d textweiser sqlt R c Projcets n Projects Renaming category Projcets to Projects tw admin v d textweiser sqlt L Categories in textweiser sqlt 01 Projects 02 Sales 5 4 tw learn Learn Category Characteristics tw learn determines category characteristics using a set of representative documents Similar documents can then be automatically classified If a document has been learned erroneously as an example of a category tw learn is able to unlearn characteristics by updating the learned associations and optimizing the database afterwards Lingua Systems Textweiser SDK v1 3 0 Page 31 Learn Document s Unlearn Document s Figure 20 tw learn Learning of Category Characteristics In order to instruct tw learn to determine and learn characteristics pass the paths to the representative documents The categor
17. ated and initialized using the backup file created before tw admin v d restored sqlt C Creating Textweiser tables in restored sqlt tw backup v d restored sqlt R i example bup Restoring backup from example bup to restored sqlt tw admin v d restored sqlt L Categories in restored sqlt 01 Projects 02 Sales Lingua Systems Textweiser SDK v1 3 0 Page 35 A Example Application add learn c include lt stdio h gt include lt stdlib h gt include lt tw h gt struct cat const char name const char text L struct cat cats 1 Cinema Several new films start this weekend Weather Today it is a bit cloudy ie int main int argc char argv tw_errno_t rv TW OK tu contig t cig IW CONFIG INITIALIZER EE tw TW_INITIALIZER short i 0 Initialize a Textweiser object using the SQLite database backend cfg db_name example sqlt EV a baca if rv TW_OK tw_free amp tw fprintf stderr Failed to initialize s n tw_strerror rv return EXIT_FAILURE for i 0 i lt sizeof cats sizeof struct cat i printf Adding category s n cats il name rv tw_add_category amp tw cats il name if rv TW_OK Page 36 Lingua Systems Textweiser SDK v1 3 0 tw_free amp tw fprintf stderr Failed to add category tw_strerror rv return EXIT_FAILURE 4s n printf Learning text Z
18. ce aes 27 TWZENOSUTE tee 27 TW EPREPROC esses 27 TW ERLOChKR 27 TW ESHORT sente Rh ERREUR ERAS ESEE 27 TW OK TEE 27 LUV ES arar a E 21 bw free Categories 21 tw_froe_config_t oooooocoocorncnnc eee 21 tw_free_prob_t 2 cece eee eee eee 21 tw get categories eee eae 22 tW Indo bo oce ee ER EE RE NR ENTE 22 NEE SERA 16 22 tw_learn tw_learn_file 22 tw optimize db yaoi sii da 23 bw Daree Configi o 23 iW prob bilis 16 20 21 tw rename Caiegori 23 tw restore do 19 tw_strerror sata o Aer 24 TUDES Seres bs atcha a tania ates rated a 16 21 22 tw unlearn tw_unlearn_file 24 IW VBISIOITU EE 25 28 TW VERSION BUGFIX terr tte 28 TW VERSION MAJOR 28 TW VERSION MINOP sonnn 28 TW VERSION STRING 28 tw version string oooooommomo 25 28
19. dicator to implement an adequate error handling The return value TW_OK indicates that the function was successful Textweiser Function Yes returns Error Code 7 passed to No returns evaluate tw strerror Error Handling returns Error Message Figure 13 Flowchart of Textweiser s Error Handling Page 26 Lingua Systems Textweiser SDK v1 3 0 4 5 1 tw_errno_t Named Error Constants Textweiser uses the data structure tw_errno_t to provide named error constants for all error cases Any error code may be used with tw_strerror to obtain an English error message describing the error see chapter 4 4 15 page 24 The following table comprises all named error constants used in Textweiser version 1 3 0 accompanied by the error messages returned if passed to tw_strerror Constant Error Message TW OK No error TW ENOMEM Failed to allocate memory TW EARG Invalid argument TW ESHORT Insufficient input length TW EPREPROC Failed to preprocess text TW ENOINIT Object not initialized TW EIO File input output error TW EFOPEN Failed to open file TW ECFG Failed to parse configuration file TW ECAT Invalid category TW ENOSUTF Not a supported Unicode Transformation Format TW ERLOCK Failed to lock resource TW ECONSTR Constraint violated TW EBFMT Invalid backup file format TW EBINV Invalid backup data TW EDBPERM Database denied permission TW EDBIO Database input output e
20. ed or an error occurred For details on error handling see chapter 4 5 on page 26 When the Textweiser object is not needed any longer its memory should be freed with tw_free see chapter 4 4 5 on page 21 The function is thread safe and thus can be used by more than one thread at a time 4 4 11 tw_learn and tw_learn_file tw_errno_t tw_learn tw_t tw const char cat const char str tw_errno_t tw learn file tw t tw const char cat const char path These functions analyse an input document and store its characteristics to a category s profile The functions take a pointer to an initialized Textweiser object tw t as a first argument see chapter 4 3 1 on page 16 The second parameter cat denotes the category to train The third argument is the document that is an example of the category The document can be provided as a string tw learn or as a path to a file tw learn file The document has to be encoded in UTF 8 For hints on using mono hierarchical category structures please refer to chapter 3 1 2 on page 8 Page 22 Lingua Systems Textweiser SDK v1 3 0 A minimum amount of documents to learn for each category is ten documents Please take care that the documents are representative for this category and differ from each other The functions return an error code that indicates whether the respective function succeeded or an error occurred For details on error handling see chapter 4 5 on page 26 Both
21. elete category Category Name tw rename category d New Category Name tw backup db A Backup File S Backup File tw restore db Figure 5 Flowchart of the Functions for Administration New Textweiser databases can be created with tw create db When training of categories is completed the database can be optimized in performance and accuracy using tw optimize db Categories can be added renamed and deleted using tw add category tw delete category or tw rename category Additional functions to maintain the database are tw backup db and tw restore db which create backups of the Textweiser database and restore the data if necessary tw erase db deletes any Textweiser data from a database Page 12 Lingua Systems Textweiser SDK v1 3 0 4 1 2 Functions for Resource Handling Resource Handling Textweiser Object gt tw_free Classification Results tw free prob t y SECURITE tw free categories tw init Textweiser Object Database Parameters Figure 6 Flowchart of the Functions for Resource Handling tw init initializes a new Textweiser object and opens a connection to the database The object and its allocated memory can be freed with tw free if it is no longer needed this function closes the database connection as well The allocated memory used for classification results stored in tw prob t can be freed with tw free pr
22. er 3 2 on page 9 for an explanation Whenever a set of classification results shares equal probabilities parent categories will precede their subcategories The option b or sub first changes this sorting behaviour and places subcategories in front of their parent categories Short Long Option Parameter Description X threads Number Use the given number of threads n show Number Show at max number results r distribute Use distributed probabilities b sub first On equal probabilities subcategories first Figure 22 tw classify Classification Options The available parameters used to connect to the database are described in chapter 5 1 on page 29 Lingua Systems Textweiser SDK v1 3 0 Page 33 5 5 1 Usage Example The following examples assume the SQLite version of Textweiser is used The following examples show how tw classify classifies four documents using two threads once using the default output settings and once using verbose processing mode combined with a user defined setting regarding the amount of results to show tw classify d textweiser sqlt x 2 text_1 txt text_2 txt text_3 txt text_4 txt text_1 txt Sales text_2 txt Sales text_3 txt Projects text_4 txt Projects tw classify v d textweiser sqlt x 2 n 5 text_1 txt text 2 txt text 3 txt text_4 txt Classification results for text 1 txt 01 Sales gt 100 004 02 Projects gt 41 25 C
23. gory 4 4 2 tw_backup_db and tw_restore_db tw_errno_t tw_backup_db tw_t tw const char out_path tw errno t tw restore db tw t tw const char in path tw backup db generates a backup of a Textweiser database and stores it to a file tw restore db restores a database from such a backup file Both functions take a pointer to an initialized Textweiser object tw t as a first argument see chapter 4 3 1 on page 16 As a second argument tw backup db expects a path to a file the backup should be stored in tw restore db expects a path to a previously created backup file When tw restore db is used any existing data in the database will be replaced by the data of the backup file The functions return an error code that indicates whether the respective function succeeded or an error occurred For details on error handling see chapter 4 5 on page 26 Both functions are thread safe and thus can be used by more than one thread at a time 3 tw_backup_db overwrites a backup file if it already exists Lingua Systems Textweiser SDK v1 3 0 Page 19 4 4 3 tw_classify tw_classify_v2 tw_classify_file and tw_classify_file_v2 tw_errno_t tw_classify tw_t tw const char str short n tw_prob_t probs tw_errno_t tw_classify_v2 tw_t tw const char str short n tw_classify_opt_t opt tw_prob_t probs tw_errno_t tw_classify_file tw_t tw const char path short n tw prob t probs tw errno
24. ingua Systems Textweiser SDK v1 3 0 1 Microsoft SQL Server Textweiser does not implement encryption itself but relies on the used database driver for this task However Textweiser will assure that the driver is configured to use solely encrypted connections on request Further information on how to configure Textweiser to use encrypted connections can be found in chapter 4 3 4 on page 17 and concerning the commandline applications in chapter 5 1 on page 29 3 4 1 Microsoft SQL Server Microsoft SQL Server provides SSL secured connections using certificates By default a certificate provided by the database server is validated and the connection will be rejected if the certificate fails to validate and Textweiser has been configured to use encryption Whenever no certificate has been assigned to the server Microsoft SQL Server will generate a self signed certificate that may be used for encryption without validation If you intend to use this certificate Textweiser has to be configured to instruct the database driver to trust the server certificate without validation To accomplish this the member encrypt of the used variable of type tw_config_t has to be set to both TW_ENCRYPT_ON and TW_ENCRYPT_TRUST_CERT When using the commandline applications use both the encrypt and trust cert options In a Textweiser configuration file setting the key encrypt to a value of on trust cert is sufficient to allow encryption using
25. lassification results for text_2 txt 01 Sales gt 100 00 Classification results for text_3 txt 01 Projects gt 100 00 02 Sales gt 16 38 Classification results for text_4 txt 01 Projects gt 100 00 Page 34 Lingua Systems Textweiser SDK v1 3 0 5 6 tw backup Backup and Restore the Database tw backup is used to create and restore Textweiser database backups When restoring from a backup all possibly existing data records of the selected Textweiser database will be erased and replaced by those of the backup Y N Y bj e N Backup DB Restore DB Figure 23 tw backup Textweiser Backup The modes can be activated using the options B or backup and R or restore respectively It is mandatory to specify a backup file as well o or output is used to set the output file in backup mode while i or input expects a path to a previously created backup file as an argument The available parameters used to connect to the database are described in chapter 5 1 on page 29 5 6 1 Usage Example The following examples assume the SQLite version of Textweiser is used and utilize verbose processing mode All categories known to the current database are displayed and a backup is created afterwards tw admin v d example sqlt L Categories in example sqlt 01 Projects 02 Sales tw backup v d example sqlt B o example bup Storing backup of example sqlt to example bup A new database is cre
26. ns These applications allow users to classify text and administrators to maintain the system The applications may be used automated within scripts as well Any input passed to Textweiser has to be plain text and should be encoded in UTF 8 The text data is preprocessed language dependent to optimize the results Therefore it is recommended to use Textweiser only with supported language data Other languages can be processed as well but results are likely to be less precise A list of supported languages is provided in the software specification Page 6 Lingua Systems Textweiser SDK v1 3 0 2 Installation 2 1 Requirements Textweiser requires the system s standard C and thread libraries Additional requirements depend on the database used With SQLite no further dependencies occur as this database software is already included With Microsoft SQL Server Texiweiser additionally requires both the standard ODBC odbc32 d11 and the SQL Server Native Client 10 0 library to be installed Hints on installation are available on the Microsoft Developer Network MSDN see appendix C on page 40 2 2 What Will Be Installed The Textweiser SDK contains a dynamic library DLL SO its header file the code of an example application and this manual The Software Development Kit for Linux contains the following files Jl lala E tw admin tw backup tw classify tw learn sd dace examples LICENSE manual sdk eng pdf
27. ob t Accordingly tw free config t frees memory used by a tw config t data structure and tw free categories frees the memory used by a category listing 4 1 3 Functions for Learning Learning Figure 7 Flowchart of the Functions for Learning In order to assign unknown documents to a category Textweiser has to learn the characteristics of each category with the help of representative documents Use tw earn or tw learn file to train Textweiser If a document was learned by mistake use tw unlearn or tw unlearn file to undo the training Lingua Systems Textweiser SDK v1 3 0 Page 13 4 1 4 Functions for Classification Classification E tw_classify_v2 B tw prob t category string foc QUEM A tw classify file v2 Figure 8 Flowchart of the Functions for Classification As soon as all categories have been trained Textweiser can classify documents The functions tw classify tw classify v2 tw classify file and tw classify file v2 assign unknown text to categories 4 1 5 Auxiliary Functions Auxiliaries Error Description M Error Code tw_strerror Category 1 Category 2 tw_get_categories tw_version Version Number tw_version_string Version String Configuration ae tw_parse_config tw_config_t host string user string password string db_name string port number Figure 9 Flo
28. ook at the documentation of the application programming interface chapter 4 on page 12 Administrators who want to install the software can obtain all necessary information from chapter 2 page 7 Lingua Systems Textweiser SDK v1 3 0 Page 5 1 Introduction Textweiser is a text classifier that assigns unknown text documents to categories The software s administrator prepares the software for usage first all categories have to be added to the system After adding the categories they must each be trained with a set of representative documents at least ten for each category Textweiser analyses the documents and extracts the relevant information needed to classify unknown documents afterwards The information is stored in a database When training is accomplished Textweiser can classify unknown documents and the system is ready for use Textweiser can for example be used to suggest categories automatically route emails or more generally in the field of document management The software can handle both flat and mono hierarchical category structures taxonomies The handling of hierarchies is fully supported see chapter 3 1 on page 8 An intuitive interface to the library allows you to integrate Textweiser easily The C C library is thread safe and provides access to all functions needed to make use of a text classifier within your own application Additionally Textweiser comes along with a set of commandline applicatio
29. probability sums up to the absolute probability of the most probable category which not necessarily needs to be 100 Category Absolute Distributed 1 Technology 100 00 32 14 2 Economy 69 66 22 39 3 Economy Markets 61 61 19 80 4 Sport 44 69 14 36 5 Sport Football 35 17 11 30 Figure 3 Example of different probability calculations Lingua Systems Textweiser SDK v1 3 0 Page 9 3 3 Common Workflow A typical workflow includes the following steps Create a database Add categories Learn documents Optimize database a A OO N gt Classify Optimizing the database increases performance and accuracy It is recommended to do an optimization whenever you learned a set of documents deleted a category or unlearned a document N Add Categories Learn Each Category s Documents N Textweiser Ready Textweiser Database D D Figure 4 Common Workflow N Classify Unknown Documents Additionally functions are provided to maintain the database Generate a backup Restore data from a backup Delete data from a database 3 4 Encryption of the Database Connection Depending on the database used Textweiser may provide the option to encrypt the connection to the database This way data will be transmitted over the network securely Textweiser supports encryption if one of the following databases is used Page 10 L
30. ration To use a specific functionality provided by tw admin the corresponding mode has to be activated by passing an option Operations on categories require the name of the respective category It has to be given as an argument to the c or cat option in order to add delete or rename a category In the latter case the new category name is expected to be given as an argument to the n or cat_new option All category names have to be UTF 8 encoded and are restricted to a maximum length of 255 bytes tw admin handles both flat and mono hierarchical category structures taxonomies If taxonomies are used any renaming operation is subject to the restrictions that the new category name has to be of the same category depth and keep the same direct top level category For further details on how to use hierarchies have a look at chapter 3 1 on page 8 The available parameters used to connect to the database are described in chapter 5 1 on page 29 Page 30 Lingua Systems Textweiser SDK v1 3 0 Short Long Option Description of Mode C create Create a new database A add cat Add a new category D del cat Delete an existing category R ren cat Rename an existing category L list List all categories 0 optimize Optimize all data records E erase Erase all data records Figure 19 tw admin Options and Modes 5 3 1 Usage Example The following examples assume the SQLite version of Textweiser is used
31. rror TW EDBFULL Database full TW EDBAUTH Database authorization failed TW EDBCON Failed to connect to database TW EDB Internal database error TW EINT Internal error Figure 14 tw errno t Named Constants and Error Messages Lingua Systems Textweiser SDK v1 3 0 Page 27 4 6 Hints on Application Development 4 6 1 Determining Textweiser s Version After including the tw h header the following preprocessor definitions are available at compile time Definition Value TW VERSION MAJOR 1 TW VERSION MINOR 3 TW VERSION BUGFIX 0 TW VERSION STRING 1 3 0 Figure 15 Version Information at Compile Time To determine Textweiser s version at runtime use tw version or tw version string see chapter 4 4 18 page 25 Page 28 Lingua Systems Textweiser SDK v1 3 0 5 Commandline Interface Textweiser includes four applications that allow to use all essential functionality on the commandline These applications can also be utilized in scripts for example to automate common administration tasks such as optimizing the database periodically Every Textweiser application provides a short help on its usage if invoked with the h parameter tw admin 1 tw learn 1 tw classify 1 and tw backup 1 The first section introduces the parameters required to establish a connection to the database A detailed overview and usage examples of the applications are given afterwards 5 1 Connecting to the Databa
32. rror codes esas ani irritada see tw_errno_t DW classi Wo isos iia 20 error handling ooooooccoccocccnccno 26 TW CLASSIFY DEFAULT 16 named Constant 27 tw classity Hill ts 20 example application bw Oaseiiv ie v i 16 20 addleamm C A 36 bw cla seihv opt TI 16 20 MIL atada dba tata batata 38 TW CLASSIPY PAD Oido 16 TW CLASSIFY DDIEST 16 F TW CLASSIFY SPARF 16 17 flat category structures essensnennnnnn 8 TW CLASSIFY SSUBF 16 17 functions IW classity 2 EE 16 20 administration uc tee eer tete 12 TW CONFIG INITIALIZER 17 23 auxiliares 4 ee E Ets 14 TW COnDTIQ ES s creer rene ER cR D RR nS 17 21 classification aati dtm dS dde deo deas 14 tw create db Lsuuuueuss 20 E en ne DEE 13 tw delete category Lsueuue 19 resource handling ooo oo 13 IW ENGT OFF 17 TW ENCHYPT ON 11 17 TW_ENCRYPT_TRUST_CERT 11 17 bor Graser AA 20 tw errno an 27 TW EAR G utet use 27 TW EREMT 27 TW ERINW nr 27 TW ECAD vn ERE RAS 27 TWAZOBEGEG stan 27 TW ECONSTR 00sec eee eee 27 ITW RE dE 27 TWEEDBAUTEH ce 27 TW EDRBCON eese 27 TW EDBFULL sees 27 TW EDBIO esses 27 TW EDBPERM eene 27 TWSEFOPEN ivtYRDbYXRu enum 27 TWENTE aia A dais 27 TW EI ctetu 27 TWENOINIT 00 ccc eee eee 27 IW ENOMEM 00 c ce
33. rs Version e 28 5 Commandline Interface 29 5 1 Connecting to the Database sio de a e ee ee Rn 29 5 2 COMMON Options s dot ts ai e Mo dob obs Whe ta te E ee 30 5 3 tw admin Textweiser Administration 30 5 9 1 Usage Examples 2 out dng de D ue eg hof e TT Aa o VE ww 31 5 4 tw learn Learn Category Characteristics oosa ooa 31 5 43 Usage Example soii cave de io o RA deas Ceo ode ig des 32 5 5 tw classify Classify Unknown Documents 33 5 5 1 Usage Example ts st stas de URGE BTE TUAE ede A Rb ea deg 34 5 6 tw backup Backup and Restore the Database 0 e 35 00 be sage Example 6 x os eg nde dh ton a eg dd tee Nd 35 A Example Application add learn c 36 B Example Application classify c 38 C References 40 Page 4 Lingua Systems Textweiser SDK v1 3 0 About this Manual This manual addresses users with experience in C C programming and at least a basic knowledge of library usage as well as users who use the commandline applications The manual provides a short introduction to the library and the applications followed by instructions how to install the Textweiser software package Afterwards some hints on the usage of a text classifier are given before the complete interface API is introduced along with the possibilities of error handling Finally the commandline applications are introduced including usage examples For a quickstart have a l
34. ry list pointed to by cats that was generated by tw get categories The function is thread safe and thus can be used by more than one thread at a time fc On Windows this function is obligatory to free the allocated memory 4 4 7 tw free config t void tw_free_config_t tw_config_t config This function frees the memory allocated by a tw_config_t data structure that has been generated by tw_parse_config It expects a pointer to a tw_config_t data structure The function is thread safe and thus can be used by more than one thread at a time This function must not be used if the tw_config_t data structure has fc been initialized or modified manually On Windows this function is obligatory to free the allocated memory 4 4 8 tw free prob t void tw free prob t tw prob t probs This function frees the memory allocated by a list of tw prob t data structures It expects a pointer to a list of tw prob t data structures as generated by tw classify and tw classify file The function is thread safe and thus can be used by more than one thread at a time x On Windows this function is obligatory to free the allocated memory Lingua Systems Textweiser SDK v1 3 0 Page 21 4 4 9 tw_get_categories tw_errno_t tw_get_categories tw_t tw char list This function generates an array of all categories in a Textweiser database and stores it to list The function takes a pointer to an initialized Text
35. s n cats il text rv tw_learn amp tw cats i name cats il text if rv De TW 0K tw_free amp tw fprintf stderr Failed to learn text sin tw_strerror rv return EXIT_FAILURE tw free amp Etw return EXIT SUCCESS The following output shows an example execution of the application Adding category Cinema Learning text Several new films start this weekend Adding category Weather Learning text Today it is a bit cloudy Lingua Systems Textweiser SDK v1 3 0 Page 37 B Example Application classify c include lt stdio h gt include lt stdlib h gt include lt tw h gt int main int argc char argv tw_errno_t rv TW_OK tWecCOnt apa LEE Cfr TW_CONFIG_INITIALIZER tw_prob_t probs NULL const char string The house prices have risen tw_t tw TW_INITIALIZER Initialize a Textweiser object using the SQLite database backend cfg db_name example sqlt IUE ll y SIC PES if rv TW OK fprintf stderr Failed to initialize s n tw_strerror rv return EXIT_FAILURE rv tw_classify amp tw string 2 amp probs tw_free amp tw if rv TW_OK if probs short i 0 for i 0 probs i i i printf Category s ZE AN probs i gt category probs i gt probability Jr tw_free_prob_t probs else d puts No results p Page 38 Lingua Systems Textweiser SDK v1 3 0 return
36. se Every application included in the Textweiser software distribution needs to access the database The required connection settings may either be passed to the application directly on the commandline or be stored in a configuration file All applications accept the following commandline parameters Short Long Option Parameter Type Example d db_name Database name String tw db S host Name of the database server String localhost u user Username String doe W passwd Password String secret p port Port of the database server Number 1433 t instance SQL Server instance String SQLEXPRESS e encrypt trust cert Figure 16 Parameters used to connect to the Database The instance option is only available in the SQL Server version of Textweiser and allows to specify the SQL Server instance that should be used The option encrypt enables encryption of the communication to the database if the database supports encryption If no encrypted connection can be established the application will abort with an appropriate error message If you want to trust the server s certificate without validation pass the trust cert option which is required in order to use self signed certificates Whenever no port is specified the application will use the default port of the database software If no password is given as a parameter the user can enter the password interactively For securi
37. se to use db name and the database s port If Microsoft SQL Server is used as a database the instance name of the server can be set using instance Encrypted connections to the database can be configured by setting encrypt to an appropriate value Textweiser provides three supported predefined values for this purpose Lingua Systems Textweiser SDK v1 3 0 Page 17 Value Description Comment TW_ENCRYPT_OFF Disable encryption Default TW_ENCRYPT_ON Enable encryption TW ENCRYPT TRUST CERT Trust certificate Requires TW ENCRYPT ON Figure 12 Valid Values for encrypt In order to establish an encrypted connection to the database without certificate validation for example to use a self signed certificate encrypt has to be set to the value that results in either the addition or bitwise OR of TW ENCRYPT ON and TW ENCRYPT TRUST CERT For further information refer to chapter 3 4 on page 10 The formal definition of the data structure is typedef struct char char char char host user passwd db name unsigned int port char instance unsigned char encrypt J ta Coniig w The database name db_name has to be encoded in UTF 8 If the database is SQLite the parameter db_name denotes the path to the database the path does not necessarily need to be encoded in UTF 8 All other parameters are ignored and should be set to NULL and 0 for port Page 18 Lingua S
38. t tw classify file v2 tw t tw const char path short n tw classify opt t opt tw prob t probs The functions analyse an input document and calculate the probability how likely a document belongs to a category A list of categories sorted descending by probability is stored to probs see page 16 The user may define the maximum number of results n Both functions take a pointer to an initialized Textweiser object tw t as a first argument see chapter 4 3 1 on page 16 The second argument is the text to classify either as a string tw classify tw classify v2 or as a file addressed with the path within the file system tw classify file tw classify file v2 The parameter n defines the maximum number of results to store to probs The functions tw classify v2 and tw classify file v2 allow to specify additional options using the parameter opt see 4 3 3 on page 16 The end of the array that contains the results is marked with a NULL element Whenever no results could be determined probs is set to NULL The data structure tw prob t is described in chapter 4 3 2 on page 16 The text to classify should be encoded in UTF 8 The functions return an error code that indicates whether the respective function succeeded or an error occurred For details on error handling see chapter 4 5 on page 26 Both functions are thread safe and thus can be used by more than one thread at a time 4 4 4 tw create db and tw erase db
39. the links or get aware of any faults feel free to give a brief hint on it via support lingua systems com Contents 1 2 Introduction Installation 21 Requirement Starner anna at E A Lec A So ek eTa 2 2 What Will Be Installed 4 454 0 RR a e ea o be ES E 2 3 Installing the A A EN Qe qv Oca over ce de 24 Deinstalling the Software 4d o EEN a IRE TER REGE RR ROSE Hints on the Usage of Textweiser 3 1 Working with Category Structures o oo 3 1 1 Flat Category Structures atc cac ie Deor CRT Boo I iC p de ate 3 1 2 Mono hierarchical Category Structures Taxonomies 3 2 Types of Probability Calculation a dq e Oboe qe PE Be whem e aw 3 9 Common Workflow i4 a 273b bee Rc do Ro OA aaa Ro RR Rod A OR 3 4 Encryption of the Database Connection o o ee 3 4 1 Microsoft SQL Server a ti o ada Application Programming Interface LEES uU A rU 4 1 1 Functions for Administration 20 oaoa a 4 1 2 Functions for Resource Handling 4 1 3 Functions Tor Learning ui ro ete ee daa x erue i 4 1 4 Functions for Classification xx REED ER 84806 ba NET 4 1 5 A xiliary FUNCI NS iae ao EEN ta eeu estes A Uo et 4 2 Configuration File 2 2 2x vC RATE Sac A dd E RO S mp ERN 4 3 Important Data Structures een e oe een o ok a eom de core ettet A 4 3 1 Texiweiser ODIBOE IW Ts sd ne qr RE IN dere Eee ae dra 4 3 2 Classification Result bw probt
40. ty reasons the entered password will not be echoed on the commandline All settings may be stored in a configuration file as well The expected configuration entries consist of simple key value pairs see chapter 4 2 on page 15 A configuration file can be selected by passing either the f or config option followed by the path of the file within the file system If other connection parameters are giving directly on the commandline these override those that may have been set by a configuration file Lingua Systems Textweiser SDK v1 3 0 Page 29 If SQLite is used as database software the parameter of d or db_name denotes the path to the database within the file system Besides that no other database connection options are required or available 5 2 Common Options Besides the options used to specify how to connect to the database all Textweiser applications provide the following set of common options Short Long Option Description v verbose Enable verbose output V version Show version information h help Show short help Figure 17 Common Options 5 3 tw admin Textweiser Administration tw admin provides the possibility to create and administrate Textweiser databases on the commandline For example new categories can be added existing categories deleted or renamed ES E Category List Categories List Categories es E gem ab EM Figure 18 tw admin Textweiser Administ
41. w unlearn or as a file tw_unlearn_file The document has to be encoded in UTF 8 For hints on using mono hierarchical category structures please refer to chapter 3 1 2 on page 8 The functions return an error code that indicates whether the respective function succeeded or an error occurred For details on error handling see chapter 4 5 on page 26 Both functions are thread safe and thus can be used by more than one thread at a time After unlearning a document the database should be updated with tw optimize ab Page 24 Lingua Systems Textweiser SDK v1 3 0 4 4 17 tw_version int tw_version The function does not take an argument and returns a numeric representation of Textweiser s version The function is thread safe and thus can be used by more than one thread at a time 4 4 18 tw_version_string const char tw_version_string The function does not take an argument and returns a pointer to a read only string containing Textweiser s version const char zl for example 1 3 0 The memory of the returned string must not be freed The function is thread safe and thus can be used by more than one thread at a time Lingua Systems Textweiser SDK v1 3 0 Page 25 4 5 Error Handling Textweiser provides an easy to use way to handle errors by evaluating the return value Every function that may fail has an error indicator as a return value Any application that uses Textweiser should evaluate this error in
42. wchart of the Auxiliary Functions tw_strerror provides an English error message for error codes used by Textweiser A list of all categories can be obtained with tw_get_categories tw_version and tw_version_string provide the library s version at runtime tw_parse_config reads and evaluates database parameters stored in a configuration file Page 14 Lingua Systems Textweiser SDK v1 3 0 4 2 Configuration File To ease managing the connection to a database all parameters necessary can be stored in a configuration file Both the function for parsing and the data structure are described in later chapters This chapter describes the syntax of the configuration file only The configuration file contains simple key value pairs for all parameters host Hostname of the database server user Username for database authentification passwd Password for database authentification db name Name of the Textweiser database port Port number of the database server instance Name of the Microsoft SQL Server instance encrypt Configuration whether and how encryption should be used Each value is associated to a key by assignment equal sign and can optionally be written in single or double quotes Empty lines and whitespace at the start or end of a line are ignored Lines starting with are interpreted as comments Special attention has to be paid to the key encrypt which may only be set to one of the following predefined values
43. weiser object tw_t as a first argument see chapter 4 3 1 on page 16 The second argument is a memory location the generated array should be stored to The end of the generated array is marked with a NULL element The function returns an error code that indicates whether the function succeeded or an error occurred For details on error handling see chapter 4 5 on page 26 The function is thread safe and thus can be used by more than one thread at a time If an error occurs or no categories could be found within the database the value pointed to by list is set to NULL 4 4 10 tw_init tw_errno_t tw_init tw_t tw const tw_config_t cfg This function connects to an existing Textweiser database and initializes a new Textweiser object The first argument is a pointer to an uninitialized Textweiser object tw_t see chapter 4 3 1 on page 16 The object is initialized by this function so it is ready for use afterwards As a second parameter the function expects a pointer to a tw_config_t data structure that contains all settings that are necessary to connect to the database Detail on tw_config_t are given in chapter 4 3 4 on page 17 You should assign the macro TW_INITIALIZER to any variable of type tw_t on declaration in order to initialize it with its default values before passing it to tw_init along with the settings of the operating environment The function returns an error code that indicates whether the function succeed
44. weiser provides an explicite notation for hierarchical relations The categories are separated by For example the category Archival with its top level categories Projects and IT is addressed with T Projects Archival When hierarchies are used with this notation Textweiser automatically organizes the data accordingly Add a category When a sub level category is added to the system any top level categories are added as well if they have not existed yet Learn a document When learning a document for a sub level category the data is assigned to all affected top level categories as well A document learned for T Projects Archival is also assigned to IT Projects und IT Rename a category If a top level category is renamed all existing sub level categories are renamed accordingly so the relations between the documents stay the same Delete a category Deleting a top level category deletes all its sub categories as well 3 2 Types of Probability Calculation Textweiser provides two types of probability calculation for classification results absolute or distributed The absolute calculation determines the probability an input document belongs to a category independent from all other categories within the set of classification results The distributed calculation determines the probability taking every other category and their relation within the set of results into account The distributed
45. y they belong to is specified using the c or cat option If it is necessary to unlearn a documents characteristics and the resulting associations the option U or unlearn switches tw learn to its unlearning mode The available parameters used to connect to the database are described in chapter 5 1 on page 29 5 4 1 Usage Example The following examples assume the SQLite version of Textweiser is used and utilize verbose processing mode First tw learn is used to determine and learn the characteristics of the documents per category and associate these with the respective category tw learn v d textweiser sqlt c Sales sales_1 txt sales_2 txt Processing sales_1 txt OK Processing sales_2 txt OK Learned 2 documents of category Sales tw learn v d textweiser sqlt c Projects projects_1 txt projects_2 txt Processing projects_1 txt OK Processing projects_2 txt OK Learned 2 documents of category Projects In order to give an example on unlearning a document will be learned as an example of the wrong category The learning process will then be reverted and the document assigned to the correct category After unlearning a document the database will automatically be optimized to update all data records accordingly In contrast to using the library directly this operation does not have to be executed manually tw learn v d textweiser sqlt c Sales projects_3 txt Processing projects 3 txt OU
46. ystems Textweiser SDK v1 3 0 4 4 Function Reference All of Textweiser s functions and data structures are defined in the header file tw h The header has to be included in all applications that make use of the following functions Two example applications for Textweiser s main functions are included in this manual see appendix A and B on pages 36 and 38 and in the software distribution 4 4 1 tw_add_category and tw_delete_category tw_errno_t tw_add_category tw_t tw const char name tw_errno_t tw delete category tw t tw const char name tw add category adds a new category to a Textweiser database tw delete category deletes an existing category and all its data Both functions take a pointer to an initialized Textweiser object tw t as a first argument see chapter 4 3 1 on page 16 The second argument is the name of the category to add or delete The category s name name has to be encoded in UTF 8 and must not exceed a length of 255 bytes For hints on using mono hierarchical category structures please refer to chapter 3 1 2 on page 8 The functions return an error code that indicates whether the respective function succeeded or an error occurred For details on error handling see chapter 4 5 on page 26 Both functions are thread safe and can thus be used by more than one thread at a time tw optimize db should be used to update the database 3 Deleting a category cannot be reverted After deleting a cate
Download Pdf Manuals
Related Search
Related Contents
Descargar PDF 取扱説明書 - IPネットワークカメラ"Viewla" PR-WLX-13 無線LANボード 取扱説明書 - 日本電気 Nouvelles acquisitions - HES Sears KENMORE 385. 17928 Operating instructions Manuale dell`operatore Copyright © All rights reserved.
Failed to retrieve file