Home

Report - Microsoft Research

1. N Decoder Class Diagram filemanager Biest vector lt string gt ewTempFile heckTempFile TempFile ake Perm ilemanager filemanager give_info give file details take info ESframelength unsigned long in BoeiMFccy i receiver Framelnfo transmitter repositorymanager luster_centers vector lt vector lt double gt llename vector lt string gt unt unsigned int epositoryname string epositorypath string iCode akeNewFileName dAMFCC tFileName epositorymanager iClusterCenter repositorymanager electRepository ewRepository gets_ informed acceptor Identification give cluster_no identifies ee codefle ath string mailid string odestream unsigned int urloc unsigned long int ppend ead eademailid Speech coding using personalized speech repository 9 4 2 SEQUENCE DIAGRAMS goto Cem Se LocatonE lt lt return gt gt addtorepository makeNewFileName lt lt retum gt addMFCC 37 locattona getFramePtr makePerm scanned goto locationD makePerm lt lt return gt gt lt lt destroy gt gt checkTempFile c delTempFile If file is temporary lt lt return gt gt commit If file not A temporary Speech coding using personalized speech repository 10 stare frame lt lt create gt gt toMono DA
2. s obtained 20 0 1 2 3 4 1 10000 millise 5 6 7 8 9 6 14 MB 486 minutes conds 10 11 20 0 1 2 3 4 2 13000 millise 5 6 7 8 9 6 14 MB 636 minutes conds 10 11 Message Parameters Using Repository Where is the message from Length of message file Length of coded file out repository in repository in repository Speech coding using personalized speech repository 30 8 PROJECT TIMELINE June 04 Sent 04 Mart Obtain Approval Problem Definition Analysis Study of earlier systems A Class Identification Usecase Analysis Analysis Review Object modeling Behavioral modeling Design Review p Design Modifications Revised Design review Alpha Implementation CH Testing Review results Optimize system parameters Beta Implementation Testing Review results Generate final report Submit project with report z t e Tt et ee ee ee Speech coding using personalized speech repository 31 9 TASK DISTRIBUTION Mumbai University recommends a group of 2 5 for the project work for the IV year BE projects We formed a group of 3 After understanding the project we realized that it basically contains 3 modules from the statement of the problem They were as follows 1 Repository generator 2 Encoder 3 Decoder On further analysis this time aimed specifically at each module we soon realized that all the module
3. uniquely identifies repository and also system path of the directory under which Speech coding using personalized speech repository 22 all the repositories are stored private vector lt unsigned Codestream buffer to be emptied into the codefile int gt private unsigned long int curloc current location inside the codestream Member Functions Visibili Return Name Parameters Description ty type Public codefile string CT path constructor to be called by the encoder String module email_id unsig ned long int size Public codefile string CT path constructor to be called by the decoder module Public int append unsigned int Appends the code specified as parameter code to this codefile Public unsigned read void Gets all the codes in this codefile into the int codestream Public unsigned getcode Void Returns the next code from the int codesream Returns END_OF_CODESTREAM when reached end of codestream Public int distance vector lt double calculates the distance between the gt mfcc current data point passed as parameter and the cluster centroids Public string reademailid Void Returns the emailid s value embedded in this codefile Public codefile void Destructor repositorymanager cpp This class is used to manage a single repository that is the output of the repository generator and used by both the encoder and decoder T
4. frames from the repository Receive the encoded file Get the individual codes Select the repository Get the frames Concatenate them Apply smoothening algos Wavefile File Manager Frame Repository Manager Dependencies This phase should be done after the repository generation and the encoder phases nterfaces vox d encoded file decoded file Speech coding using personalized speech repository 16 Resources Processing Input and output restrictions have been mentioned earlier Internet connection to receive encoded file via email Memory requirements CPU requirements 1 O channels and system services The decoding will be done using the encoded file and the repository i e NO OF CLUSTERS representative sound slices The resultant audio will be created by successively concatenating the representative sound samples indicated in the encoded file Smoothing will improve the quality of the resulting decoded sample The receiver will pass the attached file through the decoder module which will reproduce the original speech The encoded message has to be stored Finally the decoded message file is obtained clustermanager cpp This class is responsible for identifying representative frames corresponding to the cluster centers obtained by performing k means clustering on the training data set or on the message file Data members Visibility Datatype Vari
5. getallclustercenters O lt lt retum gt gt getFirstFrame BCS lt lt returm gt gt lt lt create N Encoder Sequence Diagram lt lt return lt lt created lt lt return getMFCT lt lt retum I lt lt destroy 2l heckTempFile lt lt return delTempFile lt lt return minimum Speech coding using personalized speech repository 11 minimum a lt lt retum gt gt LtEtt getNextFrame oro Z lt lt letulns gt mi lt lt fetum gt gt tt A end of Se lt cdestroy gt gt _ locationB checkTempFile G lt lt retuIn gt gt Speech coding using personalized speech repository 12 Decoder WA i l Diagram repositorymana filemanager wavefile codefile selectRepository reademailid lf repository not found goto locationB lt lt return gt gt be Pereturn gt gt newTempFile ation Tal pa getFileName Pp appendFrame a if not end of code y makePerm locationB y Speech coding using personalized speech repository 13 5 IMPLEMENTATION DETAILS Detailed description of components The v
6. unsigned int epositoryname string epositorypath string tCode akeNewFileName ddMFCC etFileName epositorymanager etClusterCenter repositorymanager electRepository wRepository Speech coding using personalized speech repository 7 filemanager ss wavefle sd hunkiD char 4 RIFF hunkSize unsigned long int Biest vector lt string gt vector lt string gt Format char 4 WAVE ewTempFile ubchunktID 4 char 4 fmt heckTempFile ubchunk1Size unsigned long int elTempFile udioFormat unsigned short int file_handler akePerm ilemanager umChannels unsigned short int filemanager ampleRate unsigned long int give_file_details yte Rate unsigned long int lockAlign unsigned short int itsPerSample unsigned short int ubchunk21D char 4 data ubchunk2Size unsigned long mt handled_file ata chat ath string urloc unsigned long int locinc unsigned long int frame ESframelength unsigned long in Bet generates avefile avefile avefile give ME is avefile accepts tFirstFrame tNextFrame lustermanager ppendFrame urrent unsigned long int ommit ist vector lt double gt oMono luster_centers vector lt vector lt double gt wavefile tbl framemfcctable ake Perm indices vector lt unsigned long int gt lustermanager ead frame codefile nit_centroids g
7. Generate Repository Repository generated successfully Status bar Speech coding using personalized speech repository 35 e Message file roatvax cdvaxa2 osha OS Browse Your Email id vox_grp yahoo cond Output file rootvox tmp a bin Browse e Encode message Encoding done successfully Input file roatvax tmp a bin Browse Output file froot voxtmp Away Browse gt Decode Message Decoding done successfully Some of the most frequently asked questions Q The program does not compile A Are all the source files together in a directory If not put them together and then try Do you have the privilege to create or modify directories If not the program will not compile or will not run properly Consult your root about this problem Q Iam unable to run the program Speech coding using personalized speech repository 36 gt O d gt O gt O gt O The program may take a long time to finish This is particularly true when you are creating a repository It may even happen during encoding or decoding phase I get errors about MFCC stuff Do you have sig2fv in the working directory of vox If not put it there Is sig2fv executable If not chmod it to 700 If you are getting errors about libtermcap or something like that just get it from somewhere sig2fv depends on it The repository generator is not working The program may take a lo
8. a sample that is closest to the centroid will be chosen as the representative These 10000 representative sound samples will then be assigned unique codes the cluster numbers have been used as the codes This collection of representative sounds and their codes will be the repository using which other sound samples can now be encoded Both the encoder and decoder will use a repository of Speech coding using personalized speech repository A speech segments The repository may be transported by CDs or may be made available for download etc Purpose To create a repository that represents the phonetically balanced characteristics of the particular user Inputs A speech file which has been recorded in the wav format of at least 20 min duration Speech should be Mono and of uncompressed format Input should be sampled at 11025 Hz with 8 bits per sample Microsoft standard for telephone quality speech Outputs A speech repository frame files characterizing the speech features of the user that automatically gets created in the user s system This repository should then be made publicly available by the creator Repository size is around 2 MB for each user Every repository consists of empirically decided 10000 representative frames and the codebook which associates the frames with their corresponding parameters Repository generator stores the repository in a directory named as per the user s email id Error messages have been handl
9. in the earlier OOP languages and it makes the creation of libraries much cleaner Overloading allows to declare a method with different parameters C maintains aspects of the C programming language yet has features which simplify memory management Additionally some of the features of C allow low level access to memory but also contain high level features C could be considered a superset of C C programs will run in C compilers C uses structured programming concepts and techniques while C uses object oriented programming and classes which focus on data C describes classes into header files and body of methods into source files By declaring instances of classes you can reuse set of variables and methods without having to define them again Memory management is unchanged Classes inherit one from other and share their methods 6 5 Makefiles We need a file called a makefile to tell make what to do Most often the makefile tells make how to compile and link a program 6 6 Edinburgh Speech Tools The Edinburgh Speech Tools Library is library of general speech software written at the Centre for Speech Technology Research at the University of Edinburgh The Edinburgh Speech Tools Library is written is C and provide a range of for common tasks found in speech processing The library provides a set of stand Speech coding using personalized speech repository 27 alone executable programs and a set of library calls which
10. the user s email id for identification purposes Error messages have been handled by standard c handling mechanisms such as try throw catch etc 3 Decoding The decoding will be done using the encoded file and the repository i e 10000 representative sound slices The resultant audio will be created by successively concatenating the representative sound samples indicated in the encoded file Smoothing will improve the quality of the resulting decoded sample The receiver will pass the attached file through the decoder module which will reproduce the original speech Purpose The decoder tool will take a code file and convert it into a decoded speech file formed by concatenating representative frames from the repository Inputs An encoded speech file which has been encoded using this software itself Code file should have been created by using the repository that is present at the decoder end i e the user should possess the repository of the sender If not available he can get it Outputs A wav file that the user can listen to Error messages have been handled by standard c handling mechanisms such as try throw catch etc Speech coding using personalized speech repository 6 4 1 CLASS DIAGRAMS wavefile hunkID char 4 RIFF hunkSize unsigned long int Format char 4 WAVE ubchunk 1 D 4 char 4 fmt ubchunk Size unsigned long int udioFormat unsigned short int umChannels unsigned s
11. Speech coding using personalized speech repository Index No Topic Pg No 1 Introduction and motivation 2 2 Problem statement 3 3 Requirement analysis 4 4 Project Design 7 5 Implementation Details 14 6 Technologies used 26 7 Test cases 30 8 Project Timeline 31 9 Task Distribution 32 10 References 33 11 Appendix 34 Speech coding using personalized speech repository 1 1 INTRODUCTION amp MOTIVATION The project deals with the idea of achieving compression by coding a person s speech using digital signal processing clustering and vector quantization algorithms People download a lot of audio and video over the Internet Generally it takes a huge lot of time to download the audio speeches e g downloading news spoken by a news reader commentary of a particular match which created some history in the concerned sport budget presentation by the Finance Minister important messages by the President for the general public etc In such cases so as to optimize the time required to download these huge files our project focuses on the speech compression by speech coding Since the process has to be carried out individually for every person therefore the term personalized in the title This work is based on the intuition that in a speech sample of a particular person similar elementary sounds are repeated For example when a person says cricket and club the initial kk sound in both words
12. T e d filename module_specific_options If option If option If option DESCRIPTION repository generation encoding decoding path of the input file 0 vii r then emailid e then output_filename d then output_filename A system for exchanging voice messages over mail using very high speech compression The sender can record his voice message and transform it into the coded compressed file using the encoder module The coded file can be transferred as an email attachment The receiver may then pass the attached file through the decoder module which reproduces the original speech Both the encoder and decoder use a repository of speech segments generated using the repository generator module Speech coding using personalized speech repository 24 Parameter name Typical value Description SUCCESS 1 Denotes successful completion of the routine FAILURE 0 Denotes failure in the routine due to some error END_OF_CODESTREAM OxFFFFFFFF Denotes the end of the code file REP_PATH repositories Path of the directory where the repositories are stored CODEBOOK rep_file bin Name of the codefile MAXPATH 256 Maximum size of the path voxtemppath tmp Denotes the directory name where the temporary files are stored FRAMELENGTH 0 02 Denotes the length of the frame in seconds SAMPLERATE 8000 Denotes the sampling rate in samples per second BPS 16 Denotes the nu
13. able name Description private long int Current current cluster number being processed private vector lt double gt Dist distance of each cluster center from the current data point private vector lt double gt Centroid MFCC parameters of a particular cluster center private vector lt vector lt d cluster_centers centers of the clusters ouble gt gt private vector lt unsigned Indices indices of frames in mfcc table to be long int gt added to the repository private vector lt int gt Count count of members in each cluster currently Public framemfcctable Fmtbl Stores the mfcc values for all the frames Speech coding using personalized speech repository 17 Member Functions Visibility Return Name Parameters Description type Public clustermanager Void constuctor for the clustermanager class Public void showcenters Void Display all the cluster centers Public int initcentroids int iter Initializes cluster centroids by randomly selecting tuples from the mfcc table Public int Start Void Initiates clustering algo Public int Distance Void calculates the distance between the current data point taken from mfcc table and the cluster centroids Public int distance vector lt double calculates the distance between the gt mfcc current data point passed as parameter and the cluster centroids Public int minimum Void Finds the minimum distance of current frame from all oth
14. arious components used in the modules as shown above are listed below module wise repository generator cpp Identification Repository generator ype Purpose To create a repository that represents the phonetically balanced characteristics of the particular user Function Split into frames Find MFCC parameters Perform clustering Prepare repository Subordinates Cluster Manager Frame MFCC Table Repository Manager Dependencies This phase should be done before encoding and decoding Interfaces vox r training_file emailid It interfaces with the Edinburgh speech tools sia2fv functionality to get the MFCC parameters Input and output restrictions have been mentioned earlier Resources Internet connection to publish repository Heavy Memory reauirements CPU requirements 1 O channels cdwriters to publish repository libraries and system services Processing A recorded lecture will be obtained All experiments will be conducted using this sample sampling rate 8000 Hz single channel and 16 bits sample A feature balanced sample of duration of a few minutes will be utilized for repository generation This file will be divided into a number of files of FRAME LENGTH duration each MAX_DIM number of MFCC features Mel frequency cepstral coefficients will be computed for each of these sound slices MFCC features are perception based features which are widely used in the speech recognition arena It is assumed t
15. as an email attachment The receiver passes the attached file through the decoder module which reproduces the original speech Both the encoder and decoder will use a repository of speech segments This repository will be pretty large in size and may need to be transported by CDs etc The entire system encoder decoder and repository generator needs to be prepared and coded for Linux The project should deliver a easy to use package it may be set of command line tools which will enable the proposed exchange of voice messages The encoder tool should just take a sound file maybe in the WAV format and convert it into a compressed binary file The decoder tool does the opposite job The repository generator tool works on a large sample of speech to generate the corpus Speech coding using personalized speech repository 3 3 REQUIREMENT ANALYSIS 3 1 Introduction The project involves building a system for exchanging voice messages over mail using very high speech compression as described above The sender will record his voice message and transform it into the coded compressed file using the encoder module The coded file is transferred as an email attachment The receiver passes the attached file through the decoder module which reproduces the original speech Both the encoder and decoder will use a repository of speech segments The repository may be transported by CDs or may be made available for download etc The entire system enc
16. cal toolkit Tcl and the Tk toolkit comprise one of the earliest scripted programming environments for the X Window System Though it is venerable by today s standards Tcl Tk remains a handy tool for developers and administrators who want to rapidly build graphical frontends for command line utilities Tcl and Tk come bundled with most major Linux distributions and source based releases are available from tcl sourceforge net If Tcl and Tk are not installed on your system the source releases are available from the SourceForge Tcl project http tcl sourceforge net Binary builds for most Linux distributions are available from rpmfind net A binary release is also available for Linux and other platforms from Active State at http aspn activestate com ASPN Tcl Speech coding using personalized speech repository 28 Tcl is built up from commands which act on data and which accept a number of options which specify how each command is executed Each command consists of the name of the command followed by one or more words separated by whitespace Because Tcl is interpreted it can be run interactively through its shell command tclsh or non interactively as a script When Tcl is run interactively the system responds to each command that is entered as illustrated in the following example You can experiment with tclsh by simply opening a terminal and entering the command tclsh Tcl s windowing shell Wish is an interpreter that reads commands fro
17. can be linked into user programs sig2fv Generate signal processing coefficients from waveforms sig2fv is used to create signal processing feature vector analysis on speech waveforms The following types of analysis are provided e Linear prediction LPC e Cepstrum coding from Ipc coefficients e Mel scale cepstrum coding via fbank e Mel scale log filter bank analysis e Line spectral frequencies e Linear prediction reflection coefficients e Root mean square energy e Power fundamental frequency pitch 6 7 Tk tcl Tool Command Language The Tcl language and Tk graphical toolkit are simple and powerful building blocks for custom applications The Tcl Tk combination is increasingly popular because it lets you produce sophisticated graphical interfaces with a few easy commands develop and change scripts quickly and conveniently tie together existing utilities or programming libraries One of the attractive features of Tcl Tk is the wide variety of commands many offering a wealth of options Most of the things you d like to do have been anticipated by the language s creator John Ousterhout or one of the developers of Tcl Tk s many powerful extensions Thus you ll find that a command or option probably exists to provide just what you need The tool command language Tcl pronounced tickle is an interpreted action oriented string based command language It was created by John Ousterhaut in the late 1980 s along with the Tk graphi
18. cess is over communication can begin almost instantly The following are the most prominent advantages of this system gt Efficient Bandwidth Usage Since only codes are transmitted and not actual speech the system uses very little bandwidth and is extremely speedy and cost effective Clarity Of Communication Expression and understanding of emotions are better in voice communication Usable as a shared library Easy to use package Applications gt News broadcast and archival Consider the audio news downloads which appear on news websites These news items are typically read out by one person or a small group of persons The actual news audio samples can be encoded based on the profile The users will only need to download the encoded data This can be decoded using the profile stored earlier by the user and the audio can be regenerated Streaming and audio conferencing Instead of communication via e mail this system can act as a phone so that two people can communicate in real time Extending this idea further multicasting will help in creating a virtual conference wherein the voice of speaker will be made audible to the entire audience Speech coding using personalized speech repository 38 For more information visit http vox sf net Hardware Requirements Linux Compatible Machine Pentium etc Recommended Pentium III or equivalent Soundcard Keyboard Monitor Speakers Microphone Not essential but Recommend
19. d line based tool we created the main file vox cpp which presented the user with the desired module of the available three Finally to implement a GUI for our tool we used Tk After having a working tool in our hand we tested the system with different parameters which we had very cautiously isolated in parameters cpp We studied various test cases that were provided by our guide and those generated by us to improve the quality of the tool by deciding upon the appropriate parameter values Speech coding using personalized speech repository 32 Wi VVVVVVV VV 10 REFERENCES Ki Seung Lee and Richard V Cox A very low bit rate speech coder based on a recognition synthesis paradigm JEEE Transactions on Speech and Audio Processing 2001 Suresh Balakrishna Speech Recognition using Mel Cepstrum features Mississippi State University 1998 http www it iitb ac in chetanv http www speex org http www elet polimi it upload matteucc Clustering tutorial_html kmeans html http www festvox org http www sourceforge org http www opensource org http www psytechnics com downloads 2001 P02 pdf http www pesq org www tcl tk Speech coding using personalized speech repository 33 11 APPENDIX User manual VoX is an acronym for Voice eXchange VoX is a nifty command line and GUI based tool that is used to encode speech files using a repository Sample passages can be used to generate a good training file This w
20. e STL s generic algorithms work on native C data structures such as strings and vectors STL containers are very close to the efficiency of hand coded type specific containers Advantages of the STL gt You don t have to write your classes and algorithms It saves your time gt You don t have to worry about allocating and freeing memory That s a big problem when you create you own linked list queue or other classes gt Reduces your code size because STL uses templates to develop these classes gt You have to override your functions or classes to operate on different types of data while STL let you apply these classes on different kind of data gt Easy to use and easy to learn 6 3 Emacs For programming on the CSE Unix system Emacs features are as follows gt source code coloring Speech coding using personalized speech repository 26 Automatic indentation Line numbers Split screen compilation Automatic line wrapping Automatic backups Free Windows version VVVVVV 6 4 C under LINUX C is an object oriented programming language created by Bjarne Stroustrup and released in 1985 It implements data abstraction using a concept called classes along with other features to allow object oriented programming Parts of the C program are easily reusable and extensible existing code is easily modifiable without actually having to change the code C adds a concept called operator overloading not seen
21. ed Internet connection Not essential but Recommended RAM atleast 256 MB Recommended Secondary Storage Hard disc gt 5GB CD RW Drive if Internet not available CD RWs Software Requirements Operating System Linux Playback Software that supports uncompressed Wavefile at 8000Hz Mono channel 8 bits sample Recording Software Not essential but Recommended that supports uncompressed Wavefile at 8000Hz Mono channel 8 bits sample CD RW software if CD RW drive is present Web browser and E mail client The project will be independent of all these gt speech recording software and hardware gt e mail software and communication network gt sound reproduction software and hardware Speech coding using personalized speech repository 39
22. ed by standard c handling mechanisms such as try throw catch etc 2 Encoding A new 10 second sample will be taken and divided into 20 ms slices MFCC features will be extracted from the 500 sound slices created this way Each of these feature vectors will be taken and a closest match will be found from the 10000 feature vectors of the representative samples of the profile This will be done by determining the minimum Euclidean distance in the 12 dimensional feature space Thus for each of the 500 sound slices a representative sound from the profile will be identified The encoded file will consist of this sequence of codes of the representative sound samples The sender will record his voice message and transform it into the coded compressed file using the encoder module The coded file will be transferred as an email attachment Purpose The encoder tool will take a sound file and convert it into a compressed binary file using the repository Inputs A speech file which has been recorded in the wav format Speech should be Mono and of uncompressed format Input should be sampled at 11025 Hz with 8 bits per sample Microsoft standard for telephone quality speech Speech coding using personalized speech repository 5 Outputs The code file that has to be transmitted over the internet to the receiver For an input file of 10 sec duration an output file code file of around 2 2 KB will be generated This codefile will also contain
23. efile amp wv Copy Constructor Calls wavefile wv Public Frame const Copy Constructor Calls wavefile wv wavefile amp wv Public Frame frame amp wv Copy Constructor Calls wavefile wv public frame Void Destructor framemfcctable cpp This class is responsible for populating and retrieving mfcc parameters from the mfcc table for current frame Data members Visibility Datatype Variable name Description private vector lt vector lt d mfcctable Stores the 12 mfcc parameters for each ouble gt gt frame Member functions Visibilit Return Name Parameters Description y type Public framemfcctable void Constructor Implicit constructor Public int addFrame vector lt double Adds the mfcc parameters for the gt mfcc current frame into the mfcc table Public vector lt do getFrameMFCC inti int status Gets the mfcc parameters for the uble gt current frame from the mfcc table Public unsigned nFrames wavefile amp wv Returns the size of the mfcc table long int codefile cpp This class represents the codefile that is the output of encoder and used as an input to the decoder It is responsible for holding the emailid and codes and for the operations on these data members Data members Visibility Datatype Variable name Description private private String string Emailed name of the repository directory Path
24. er cluster centroids Public int recalculate 1 int min Recalculates the new cluster centroid after the current frame has been added to the cluster Public vector lt u getIndices Void Gets indices of the representative nsigned cluster centroids mfcc parameters long int gt from mfcc table Public vector lt v getcentroids Void Gets mfcc values of the ector lt do representative cluster centroids uble gt gt Public int getallclustercente string email Gets the cluster centers from the rs codebook which is being managed by repositorymanager Public unsigned compare vector lt double Combines the functionality of int gt mfcc distance and minimum to find representative for the frame passed as the parameter wavefile cpp This class is responsible for representing the wavefile and performing operations related to it like creation getting MFCC parameters breaking wavefile into frames making wavefile from constituent frames Visibility Datatype Variable name Description Protected Char 4 ChunkID Contains the letters RIFF in ASCII form 0x52494646 big endian form Protected unsigned long int ChunkSize 36 SubChunk2Size or more precisely 4 8 SubChunk1 Size 8 SubChunk2Size This is the size of the rest of the chunk following this number This is the size of the entire file in bytes minus 8 bytes for the two fields not included in this count Speech coding using p
25. ersonalized speech repository 18 ChunkID and ChunkSize Protected Char 4 Format Contains the letters WAVE 0x57415645 big endian form Protected Char 4 Subchunk1ID Contains the letters fmt 0x666d7420 big endian form Protected unsigned long int Subchunk1 Size 16 for PCM This is the size of the rest of the Subchunk which follows this number Protected unsigned short AudioFormat PCM 1 i e Linear quantization int Values other than 1 indicate some form of compression Protected unsigned short NumChannels Mono 1 Stereo 2 int Protected unsigned long int SampleRate 8000 44100 etc Protected unsigned long int ByteRate SampleRate NumChannels BitsPerSample 8 Protected unsigned short BlockAlign NumChannels BitsPerSample 8 int The number of bytes for one sample including all channels Protected unsigned short BitsPerSample 8 bits 8 16 bits 16 etc int Protected Char 4 Subchunk2ID Contains the letters data 0x64617461 big endian form Protected unsigned long int Subchunk2Size NumSamples NumChannels BitsPerSample 8 This is the number of bytes in the data You can also think of this as the size of the read of the subchunk following this number Protected char Data The actual sound data Protected String Path Location of open wavefile Protected unsigned long int locinc Size of Each Subchunk2Size of each frame Protected unsigned long int Curl
26. hat 10000 NO_OF CLUSTERS different elementary sounds will be enough to characterize the range of sounds produced by a person This number will be arrived at empirically The sound samples will then be clustered into NO OF CLUSTERS clusters based on their MFCC features Speech coding using personalized speech repository 14 A variant of the k means algorithm will be used for clustering For each of these clusters a sample that is closest to the centroid will be chosen as the representative These NO OF CLUSTERS representative sound samples will then be assigned unique codes the cluster numbers have been used as the codes This collection of representative sounds and their codes will be the repository using which other sound samples can now be encoded Both the encoder and decoder will use a repository of speech segments The repository may be transported by CDs or may be made available for download etc Data Repository containing the frames and a codebook encoder cpp The encoder tool will take a sound file and convert it into a compressed binary file using the repository Split into frames Find MFCC parameters Perform vector quantization Prepare encoded message Subordinates Wavefile File Manager Frame Cluster Manager Frame MFCC Table Repository Manager Code File Dependencies This phase should be done before decoding and after the repository for the concerned person has been generated Func
27. he repository is identified by the emailid ad is stored as a directory containing a codebook and representative frames Data members Visibility Datatype Variable name Description private String Emailed uniquely identifies repository and also name of the repository directory private string Path system path of the directory under which all the repositories are stored Speech coding using personalized speech repository 23 Member Functions Visibility Return Name Parameters Description type Public repositorymanag void Implicit constructor er Public repositorymanag string Creates a repository with the name er email_id int as specified by the create NOCR parameter email_id when used in EATE repository generation Used to access the repository in the encoder and the decoder phases Public string makeNewFileNa inti Generates a new file name as me specified by the email_id and integer i Public vector lt do getClusterCenter unsigned inti Gets all the cluster centroids for uble gt this repository Public int addMFCC vector lt double Insert the mfcc parameters of the gt mfcc cluster center in the codebook Public string getFrameName unsigned int Gets the filename for the specified code code Public repositorymana void Destructor ger VOX CDD NAME vox Voice eXchange SYNOPSIS vox options filename module_specific_options options
28. hort int lockAlign unsigned short int BitsPerSample unsigned short ini ubchunk2 D char 4 data ubchunk2Size unsigned long int ata char ath string urloc unsigned long int locinc unsigned long int avefile avefile avefile avefile etFirstFrame etNextFrame ppendFrame ommit oMono wavefile akePerm amp frame ESframelength unsigned long in W8getmFccy framemfcctable rametable vector lt frame gt fectable vector lt vector lt double gt ramemfcctable etFrameMFCC etFramePtr ddFrame lelFrame framemfcctable informer indicated Se generator 4 PROJECT DESIGN Repository Generator Class Diagram filemanager filelist vector lt string gt ewTempFile heckTempFile elTempFile akePerm ilemanager filemanager file_persistency_info informed indicate_permanency clustermanager urrent unsigned long int sadist vector lt double gt ocluster_centers vector lt vector lt double gt mtbl framemfcctable eeindices vector lt unsigned long int gt indicator give _MFCC_param acceptor hold_frames tallclustercenters ompare receiver Sendell addrepresentative repositorymanager luster_centers vector lt vector lt double gt ilename vector lt string gt ount
29. iles that have been created Member Functions Visibilit Return Name Parameters Description y type Public filemanager void Constructor Implicit constructor Public filemanager void Destructor Deletes all the temporary files Public string newTempFile void Provides a new unique name to the frame file Public int checkTempFile string path Checks whether the file is temporary Returns 1 if temporary else 0 Public int delTempFile string path Deletes the file if it is found to be temporary Public int makePerm string dest Makes the file permanent by string src renaming it if it is found to be non temporary frame cpp This class is responsible for representing the frames and performing operations as performed by the wavefile class This class publicly inherits from the wavefile class Data members Inherited from the wavefile class Member functions Except for the constructors it inherits all the functionality of the wavefile class Other member functions are as follows Speech coding using personalized speech repository 21 Visibilit Return Name Parameters Description y type Public frame void Constructor Implicit constructor Calls wavefile Public frame char Constructor Calls frdata unsigne wavefile frdata frsize d long int frsize Public Frame string Constructor Calls wavefile wavepath wavepath Public Frame wav
30. ill eventually affect the creation of repository A good training file should be long and phonetically balanced You may use open literature to generate the training file Such literature is available at Project Gutenberg Some of the sample commands for the command line are To create directory named vox in root directory mkdir vox To copy the compressed files vox tar gz to vox directory cp vox tar gz VOX To change the directory cd vox To uncompress the compressed files tar zxvf vox tar gz To run the make file of the vox tool make To view the man page of the vox tool vox For repositorygenerator module vox r yourbigspeechfile wav youremailid somehost somedomain For encoder module vox e yourmessage wav codedfile bin youremailid somehost somedomain For decoder module vox d codedfile bin outputmessage wav Speech coding using personalized speech repository 34 Graphical interface s screenshots are shown below As you start you will see the following screen Click one of the 3 buttons on the left hand side so as to start the desired module Exit button Repository generator Encoder Decoder When you click the topmost button the following window opens up in which you need to enter the appropriate input as shown Repository Generator Module name Training file roatvaxcdvaxaz osha OS Browse File selector Your Email id vox_grp yahoo con Start execution
31. iver istance ath string give_code inimum mailid string _ ecalculate1 odestream unsigned int taker ddtorepository rloc unsigned long int clustermanager bei SC a tallclustercenters eademailid ompare get centroid Code framemfcctable rametable vector lt frame gt fcctable vector lt vector lt double gt ramem cctable etFrameMFCC etFramePtr repositorymanager luster_centers vector lt vector lt double gt ilename vector lt string gt ount unsigned int epositoryname string epositorypath string ddFrame elFrame tCode akeNewFileName ddMFCC tFileName epositorymanager N tClusterCenter Encoder Class Diagram ele wRepository framemfcctable Speech coding using personalized speech repository 8 wavefile hunkID char 4 RIFF hunkSize unsigned long int Format char 4 WAVE ubchunk1 D 4 char 4 fmt ubchunk1Size unsigned long int udioFormat unsigned short int NumChannels unsigned short int ampleRate unsigned long int ByteRate unsigned long int lockAlign unsigned short int BitsPerSample unsigned short in ubchunk2 D char 4 data ubchunk2Size unsigned long int ata char ath string urloc unsigned long int locinc unsigned long int avefile avefile avefile avefile tFirstFrame etNextFrame ppendFrame ommit oMono wavefile make Hem
32. m standard input or from file and interprets them using the Tcl language and builds graphical components from the Tk toolkit Like the tclsh it can be run interactively 6 8 Pesq PESO stands for Perceptual Evaluation of Speech Quality and is an enhanced perceptual quality measurement for voice quality in telecommunications PESQ was specifically developed to be applicable to end to end voice quality testing under real network conditions like VoIP POTS ISDN GSM etc PESQ Perceptual Evaluation of Speech Quality is a method of determining the voice quality in the telecommunications networks It combines the time alignment technique from PAMS Perceptual Analysis Measurement System with the accurate perceptual modeling of PSQM Perceptual Speech Quality Measurement the best features of each technique It is applicable not only to speech codecs but also to end to end measurement Defined by ITU T recommendation P 862 in February 2001 PESQ has become the most widely accepted standard for measuring voice quality over VoIP networks However the use of PESQ is not limited to VoIP It can be used effectively to test for example voice over frame relay VoFR voice over ATM VoATM wireless systems and cable modem and DSL systems that carry speech PESQ takes into account filtering in analog components variable delay and coding distortion It measures one way quality and is designed for use with intrusive tests Meaning of PESQ Value
33. mber of the bits per sample MAX_DIM 12 total number of dimensions involved k 10000 number of clusters VERY_HIGH_VALUE 99999 99999 Denotes a very high value NO_OF_ITER 6 Number of iterations CREATE 1 Denotes that a repository needs to be created NOCREATE 0 Denotes that a repository need not be created as it already exists Speech coding using personalized speech repository 25 6 TECHNOLOGIES USED 6 1 Linux Here are some of the benefits and features that Linux provides over single user operating systems such as MS DOS and other versions of UNIX for the PC Full multitasking and 32 bit support GNU software support The X Window System TCP IP networking support Virtual memory and shared libraries Audio amp Multimedia VVVV VV 6 2 STLs Originally the development of the STL Standard Template Library was started by Alexander Stepanow at HP in 1979 Later he was joined by David Musser and Meng Lee In 1994 STL was included into ANSI and ISO C The STL provides general purpose utility classes which programmers can use in their applications and they even don t have to worry about allocating and freeing memory These classes are array link stack string vector iterator map classes And the STL provides general algorithms for sort search or reverse arrays or links Besides these two things the STL also provides some iterators and other options you can apply on these classes Features Th
34. ng time to finish This is particularly true when you are creating a repository It may even happen during encoding or decoding phase Help VoX is stuck The program may take a long time to finish This is particularly true when you are creating a repository It may even happen during encoding or decoding phase The encoder is not working The repository generator is not working The program may take a long time to finish This is particularly true when you are creating a repository It may even happen during encoding or decoding phase The decoder is not working The repository generator is not working The program may take a long time to finish This is particularly true when you are creating a repository It may even happen during encoding or decoding phase For more information visit http vox sf net Speech coding using personalized speech repository 37 Technical manual VoX should work on any Linux Unix box VoX has been developed using g on Redhat Linux It has been tested on Redhat Linux and Knoppix VoX makes use of sig2fv tool of Edinburgh Speechtools Library You will have to compile it seperately and place sig2fv in the working directory of Vox VoX is independent of speech recording software and hardware z e mail software and communication network sound reproduction software and hardware Advantages of this system The system will be user friendly Once the repository generation and exchange pro
35. oc Current frame start location Protected FILE Fptr Associated with mfcc fil Member Functions Speech coding using personalized speech repository 19 Visibility Return Name Parameters Description type Public Wavefile void Constructor Implicit constructor used to create wavefile with header but no data Public unsigned getlocinc void Gets the location increment long int Public Wavefile char Constructor Used to create frdata unsigne wavefile using the data passed as d long int parameter frsize Public wavefile string Constructor opens the file wavepath specified by the path and initialises all private variables allocates buffer for data and copy data Public wavefile wavefile amp wv Copy Constructor copy all private variables except for the path reallocates buffer for data and copy data Public wavefile const Copy Constructor copy all private wavefile amp wv variables except for the path reallocates buffer for data and copy data Public int makeMono int type Converts this wavefile to monochannel if it is multichannel type 1 gt Sum type 2 gt Avg ret 1 gt clip Public int makePerm string dest makes a wave file permanent Public int getFirstFrame char Copies first frame of locinc frmdata unsig samples if needed padding is ned long int done in wf and returns the length frsize of frame Public int getNextFrame char Copies nex
36. oder decoder and repository generator will be prepared and coded for Linux The project will deliver an easy to use package which will enable the proposed exchange of voice messages gt The repository generator tool works on a large sample of speech to generate the corpus using clustering and Mel frequency cepstrum coefficients MFCC feature extraction processes gt The encoder tool will take a sound file and convert it into a compressed binary file using the repository gt The decoder tool does the opposite job 3 2 Steps of the process 1 Repository generation A recorded lecture will be obtained All experiments will be conducted using this sample sampling rate 11025 Hz single channel and 8 bits sample A 15 minute sample will be extracted for repository generation This file will be divided into 45000 files of 20 ms duration each 12 MFCC features Mel frequency cepstral coefficients will be computed for each of these sound slices MFCC features are perception based features which are widely used in the speech recognition arena It is assumed that 10000 different elementary sounds will be enough to characterize the range of sounds produced by a person This number will be arrived at empirically The 45000 sound samples will then be clustered into 10000 clusters based on their Mel frequency cepstrum coefficients MFCC features A variant of the k mean algorithm will be used for clustering For each of these clusters
37. s The PESQ score is mapped to a MOS like scale a single number in the range of 0 5 to 4 5 where values close to 4 5 indicate very good speech quality and values close to 0 5 indicate very bad speech quality For most cases the output ranges between 1 0 and 4 5 PESQ score 2 and below corresponds to degradation level that is difficult to understand Further mapping to MO values is the fairly straightforward process A system that assesses the quality of speech must allow for the transmission of different voices The source can be real or artificial speech Input from real speech should be based on ITU T P 830 and it is recommended the use of minimum of two male and female speakers Artificial speech is recommended only if it can represent the temporal and phonetic structure of real speech signals Test signals should include speech bursts that are separated by silent periods that represent of natural pauses in speech The typical duration of a speech burst is 1 3 seconds PESQ can also be used to assess the quality of systems carrying speech in the presence of background or environment noise Speech coding using personalized speech repository 29 7 TEST CASES Test case 1 Training File Parameters Training file Sampling Compression type used 15 Minutes Repository Parameters Reposito Huuh MFCC Number oige OF f er of Frame repositor Time required to ry Cluster length oe Ge generate repository Number used Iterations
38. s depended on some basic classes of objects e g Wavefile class a class to handle clustering class to handle repository and code files etc So we sat together and decided on the different classes to be developed reused and their interactions in various modules Then Apoorv started off with study and development of the wavefile class and its child class frame to handle various operations on wav files To handle multiple temporary frames he also developed filemanager class He was also instrumental in identifying the tools that can be used for MFCC generation Manish was handed the responsibility of handling the clustering algorithm with the time and memory efficiency considerations and vector quantization to be used and implemented as clustermanager class He worked on the implementation of framemfcctable class that is a part of clustermanager Sumeet was given the responsibility of handling the repositorymanager and codefile class which included considerations of how to represent the codefiles and the repository He also put extra efforts for testing the program at his home and was instrumental in identification of someof the key parameters in system performance Finally we decided to integrate our individual works to form 3 new classes to provide an abstraction interface between the user and these classes Thus the combined effort led to development of repositorygenerator encoder and decoder classes So as to create a complete comman
39. t frame of locinc frmdata unsig samples if needed padding is ned long int done in wf and returns the length frsize of frame Public int getFrame unsigned long Copies i frame of locinc samples int 1 string if needed padding is done in wf framename and returns the length of frame Public vector lt do getMFCC int status Gets all the parameters one by one uble gt from fptr Public int appendFrame string fpath appends a frame to this wavefile without smoothing Public int getData char Gets data for the current frame or frmdata unsign wavefile with the length indicated ed long int by frsize frsize Public int commit void copies all private variables amp data back to the location specified by path Public int showDetails void Displays the header information of wavefile Public wavefile void Destructor closes the file specified by the path and copies all private variables amp data back deallocates buffer for data and try to delete the temporary file Public unsigned nFrames void Returns Subchunk2Size getlocinc long int Speech coding using personalized speech repository 20 filemanager cpp This class is responsible for handling the various frame file operations such as creating frame files with a unique name deleting temporary ones and saving the permanent ones as the repository Visibility Datatype Variable name Description Protected vector lt string gt filelist List of all the temporary f
40. tion Interfaces vox e speech file emailid output_file It interfaces with the Edinburgh speech tools sig2fv functionality to get the MFCC parameters Input and output restrictions have been mentioned earlier Resources Internet connection to send encoded file via email Heavy Memory requirements Speech coding using personalized speech repository 15 CPU requirements 1 O channels libraries and system services Processing A small message file to be encoded will be taken and divided into slices each of size given by FRAME LENGTH MFCC features will be extracted from the sound slices created this way Each of these feature vectors will be taken and a closest match will be found from the NO OF CLUSTERS feature vectors of the representative samples of the profile This will be done by determining the minimum Euclidean distance in the MAX_DIM dimensional feature space Thus for each of the sound slices a representative sound from the profile will be identified The encoded file will consist of this sequence of codes of the representative sound samples The sender will record his voice message and transform it into the coded compressed file using the encoder module The coded file will be transferred as an email attachment Data Encoded file to be transmitted as the message decoder cpp Decoder A module The decoder tool will take a code file and convert it into a decoded speech file formed by concatenating representative
41. will have similar characteristics Significant reduction in storage could result if the actual signal information for both these sounds is not stored Instead the elementary sound is stored just once and wherever this sound appears the same stored sound is played E mail is good only for text and for graphics transmission Standard sound formats that encode human speech produce extremely large outputs that are improper for e mail communication However if certain assumptions are made about features of human speech the communication will be efficient The speech profile of a person can be created which will contain the collection of elementary sounds uttered This profile will be a one time download for the listeners The actual audio messages can be encoded based on the profile The users will only need to download the encoded data which will be much smaller than the actual audio data This can be decoded using the profile stored earlier by the user and the audio can be regenerated As only the binary codes are transferred rather than the speech signals themselves huge bandwidth compression can be obtained Speech coding using personalized speech repository 2 2 PROBLEM STATEMENT The project involves building a system for exchanging voice messages over mail using very high speech compression The sender will record his voice message and transform it into the coded compressed file using the encoder module The coded file is transferred

Report - Microsoft Research

Contents

Download Pdf Manuals

Related Search

Related Contents