Home

User Manual

1. ctory locat ion a multi word location will not be removed when occurring in isolation local_places_unambig txt lists Towns and 1 location Each line is hashed and cities around name per line scrubbed directly the hospital Multiple words are scrubbed in the context of the whole location name i e each word in a multi word location will not be removed when occurring in isolation local_places_ambig txt lists Ambiguous 1 location Each line is hashed and town and city name per line marked as potential names around PHI Multiple words are the hospital scrubbed in the context of the whole location name i e each word in a multi word location will not be removed when occurring in isolation pid_patientname txt lists Patient ID Per line PID For the PID of the This file contains PID patient firstname1 processed record each a list of surrogate first names firstname2 _ patient name first or patient names patient last lastname1 last is removed that appear in the names lastname2 directly gold standard Max 2 words corpus EOE for first names exceeding names max 2 A wod modity existing code words for last Hamas segment at the beginning of deid to extend this functionality stripped_hospitals txt lists Unambiguous 1 name per Each line is hashed and hospital line scrubbed directly names Multiple words are scrubbed in the context of the whole location name i e each word
2. in a multi word location will not be removed when occurring in Name of list dictionary Dire Description Format Use in code Notes ctory locat ion isolation us_area_code txt lists US area 1 area code Used to validate if codes per line certain numeric patterns are phone numbers us_States txt lists US state 1 name per Used to check for us_states_abbre txt names line zipcodes and potential more_us_state_abbreviation locations s txt commonest_words txt dict Words that 1 word per Used in multiple places are very line to check whether common in possible names are medical commonest words records common_words txt dict Words that 1 word per Used in multiple places are common line to check whether in medical possible names are records common words sno_edited txt dict Medical 1 word per Used in multiple places terms that line to check whether generally possible names are should not be medical terms removed medical_phrases txt dict Multi word Multiple Used to check whether medical words per line possible names are part terms that of a medical phrase should not be removed notes_common txt dict Really 1 word per Used to check for common line potential names and words or locations medical terms observed in notes that should not be removed shift txt Top PIDto date PID Number Used in re identifying level offset of days of dates direct mapping forward
3. names PO BOX PO Box number State Zipcode Zipcode preceded by state names State U S state names Country Country name Company Company name Telephone The phone filter generates the following two types of PHI category tags 13 Telephone Fax Telephone or fax numbers Pager number Pager or beeper numbers Other Social Security Number Social security numbers Medical Record Number Number associated with the medical record Unit Number Unique patient number Age over 90 Age equal to 90 or older E mail address Email address URL Web URL address Holiday Holiday such as Christmas Hanukah Ramadan Ethnicity Words that indicate ethnicity or nationality such as American African Spanish etc 14
4. names of line scrubbed directly countries doctor_first_names txt lists Unambiguous 1 name per Each line is hashed and This file contains first names of line scrubbed directly a list of doctors unambiguous hospital surrogate doctors specific names in gold standard corpus doctor_last_names txt lists Unambiguous 1 name per Each line is hashed and This file contains last names of line scrubbed directly a list of doctors unambiguous hospital surrogate doctors specific names in gold standard corpus ethnicities_unambig txt lists Unambiguous 1 name per Each line is hashed and names of line scrubbed directly ethnicities female_names_unambig txt lists Unambiguous 1 name per Each line is hashed and female first line scrubbed directly Also names used in name filter to determine if a word is a first name male_names_unambig txt lists Unambiguous 1 name per Each line is hashed and male first line scrubbed directly Also names used in name filter to determine if a word is a first name Name of list dictionary Dire Description Format Use in code Notes ctory locat ion last_names_unambig txt lists Unambiguous 1 name per Each line is hashed and last names line scrubbed directly female_names_ambig txt lists Ambiguous 1 name per Each line is hashed and female first line marked as potential names PHI Also used in name filter to
5. De Identification of Free Text Medical Records User Manual version 1 1 Written by Ishna Neamatullah September 5 2006 Modified by Li wei Lehman May 15 2009 Harvard MIT Division of Health Sciences and Technology Massachusetts Institute of Technology Cambridge MA 02139 Introduction About this Document This is the user s manual for the de identification software developed at the Harvard MIT Division of Health Sciences and Technology It describes the dictionaries used by the software file format used for various input output files PHI tag types generated and an overview of the top level API of the software This document does not attempt to provide an exhaustive description of the software s purpose structure or inner workings For these details consult references listed in the bibliography Consult the README txt file for a summary of the software s installation and execution instructions About the De Identification Software The de identification software is the product of a study at the Harvard MIT Division of Health Science and Technology HST to automatically de identify confidential patient information from text medical records used in intensive care units ICUs Patient records are a vital resource in medical research Before such records can be made available for research studies protected health information PHI must be thoroughly scrubbed according to HIPAA specifications to preserve patient confidentiality Manual
6. OF_RECORD It is assumed that each patient has a unique patient ID and each note has a unique record number for the patient Note that in this current release of the gold standard corpus the record date is not supplied A default date is used in the perl code for date shifting when the record date is not supplied in the header If you would like the deid code to date shift the dates within the medical records properly you need to supply a record date for each record in the header as follows START_OF_RECORD lt Patient ID gt lt Record Number gt lt Record Date gt The lt Record Date gt should be in the format of lt MM DD YYYY gt Format for id deid The PHI location file id deid containing all gold standard PHI locations does not need to be passed into the code as an argument When the user requests performance statistics the algorithm assumes that a PHI locations file called lt filename gt deid exists in the directory The format of this deid is as follows Patient lt Patient ID gt lt TAB gt Note lt Record Number gt lt PHI Start gt lt TAB gt lt PHI Start gt lt TAB gt lt PHI End gt Patient lt Patient ID gt lt TAB gt Note lt Record Number gt lt PHI Start gt lt TAB gt lt PHI Start gt lt TAB gt lt PHI End gt An example follows for notes 1 and 2 for a patient with ID 1100 Patient 1100 Notel 12 12 15 24 24 29 Patient 1100 Note 2 10 10 18 245 245 251 310 310 312 Note The first lt PHI Star
7. cation Many of these lists and dictionaries are provided in a package along with the software The software declares arrays of context words that can be used to identify PHI The software calls the function setup to create some lookup lists of known PHI in memory for fast comparison with individual words during de identification of text It then calls the function deid to de identify the text setup Arguments None Returns None Called by Topmost level of code Description Creates some lookup lists to have in memory Function synopsis The function sets up hashes of known PHI lists and dictionaries for direct identification of words in text e g last names hospital names The function preloads some PHI dictionaries into corresponding arrays e g locations states and generates associations between PHI in some lists and PHI categories deid Arguments None Returns None 11 Called by Topmost level of code Description This function reads in the text file to be de identified calls subroutine findPHIQ to de identify text paragraph by paragraph and outputs the de identified text to a file Function synopsis The function opens the data file that contains the text to be de identified It reads in the data file paragraph by paragraph so that items that extend over lines are not missed It calls the function findPHI with each paragraph as the argument for the de identification of the paragraph The function subsequen
8. de identification on large databases tends to be prohibitively expensive time consuming and prone to error making a computerized algorithm an urgent need for large scale de identification purposes We have developed an automated pattern matching de identification algorithm that uses medical and hospital specific information The current version of the algorithm has an overall recall sensitivity of around 0 967 and a precision or positive predictive value of 0 748 Software Installation and Execution Platforms Perl 5 8 or 5 10 under Fedora Core 10 Linux 2 6 27 development and testing The code is also expected to run on Windows but have not been tested on that platform Code organization The source code is contained in a single file deid pl Each de identification run can be configured using deid config Associated dictionaries used for de identification are in folders lists and dict We suggest extending or modifying all other lists and dictionaries to suit your particular needs A comprehensive description of all lists and dictionaries follows in the next section Installation and Execution Please see the README txt file File Formats The input to the code needs to be a single text file containing the gold standard corpus with an extension text Format for id text Each record in the corpus starts with the following format START_OF_RECORD lt Patient ID gt lt Record Number gt The record must end with END_
9. determine if a word is a first name male_names_ambig txt lists Ambiguous 1 name per Each line is hashed and male first line marked as potential names PHI Also used in name filter to determine if a word is a first name last_names_ambig txt lists Ambiguous 1 name per Each line is hashed and last names line marked as potential PHI Also used in name filter to determine if a word is a last name female_names_popular txt lists Popular 1 name per Each line is hashed and male_names_popular txt female male line marked as potential last_names_popular txt first names PHI Also used in name and last filter to determine if an names ambiguous name is also a popular name last_name_prefixes txt lists Prefixes that 1 prefix per Used to identify name prefixes_unambig txt Ma ay appear ling Pee efore a last name locations_unambig txt lists Unambiguous 1 location Each line is hashed and location name per line scrubbed directly names Multiple words are scrubbed in the context of the whole location name i e each word in a multi word location will not be removed when occurring in isolation locations_ambig txt lists Ambiguous 1 location Each line is hashed and location name per line marked as potential names PHI Multiple words are scrubbed in the context of the whole location name i e each word in Name of list dictionary Dire Description Format Use in code Notes
10. guration file is used to allow users to enable disable the following filter types Name SSN URL Email Telephone Unit Number hospital patient identification number Age age over 89 Location Date and U S State There are also flags in the configuration file that can be used to control whether certain dictionaries are to be loaded and used by the code for de identification The dictionaries that can be enabled disabled in the configuration file include patient identification number PID to patient name mapping PID to date offset mapping with a date shift value for each patient country names company names hospital names location names doctor names U S city names U S state names and ethnicity An overview of the main functions in the software follows in order of execution The main function responsible for de identification is deid which calls findPHI on each paragraph of text findPHI scans through the paragraph of text and identifies PHI After deid has run each paragraph through findPHI the software calls outputText to create the de and re identified output of the paragraph We list the API of the major functions in the software below to provide an idea about its general structure TOP MOST LEVEL OF CODE Returns None Called by Command at command prompt Function synopsis The de identification software initially sets the paths of lists and dictionaries in the working directory that will be used in de identifi
11. ification process for debugging purposes The format is lt PHI Start gt lt PHI End gt lt PHI_Text gt lt PHI_TYPE gt The lt PHI_Text gt is the string from the text that corresponds to the PHI start end locations indicated in the first two numbers If it is preceded by it is NOT a PHI This means that the string was considered by the deid code as a potential PHI but the algorithm ultimately decided that it s not a PHI If it is not preceded by then it is considered a PHI In this case the lt PHI_Text gt will be followed by the lt PHI_TYPE gt string Dictionaries and Lists The code uses multiple lists of known PHI and dictionaries of words and medical terms We present the specific format of each list dictionary in the following table Multi word names are allowed in the dictionary unless specified otherwise In the case of multi word names the code will scan for the pattern in the note all words in the name must appear in the order listed in the dictionary for there to be a match Name of list dictionary Dire Description Format Use in code Notes ctory locat ion company_names_unambig t lists Unambiguous 1 name per Each line is hashed and xt names of line scrubbed directly companies company_names_ambig txt lists Ambiguous 1 name per Each line is hashed and names of line marked as potential companies PHI countries_unambig txt lists Unambiguous 1 name per Each line is hashed and
12. last name in the dictionary Doctor First Name Doctor first name Doctor Last Name Doctor last name Female First Name un Unambiguous female first name Male First Name un Unambiguous male first name Name LF Last name followed by a comma and then a first name Name MD Doctor names followed by MD Name PRE Doctor name initial preceded by words such as physician PCP provider etc Name NI Names preceded by name indicators such as mother brother husband Name Namels Name preceded by the term name is Name Prefix Prefixes Name prefixes such as de la or van der Last Name Prefixes Name preceded by prefixes such as de la or van der Name STitle Name followed by specific titles such as DR MR or MS Name PTitle Name followed by plural titles such as Drs And Professors Name NamePattern Various name patterns that involve a first name followed by an optional 1 or 2 middle initial s and then a last name Location PHI category tags generated by the location filters include the following Street Address Street address Location Location in general such as town city names Location Universities University names Hospital Hospital names Wardname Hospital ward
13. riginal data file with all the identified PHI replaced with PHI tags or shifted dates stat Arguments filenames of file containing Gold Standard PHI locations and file containing PHI locations of current de identification run Returns None Called by Topmost level of code Description Calculates code performance statistics if comparison mode is set to 1 and if Gold Standard is available Function Synopsis The function compares the PHI locations contained in the gold standard database and in the output from the de identification software determines the recall and precision of the de identification results and prints them on the screen 12 APPENDIX B PHI Tag Types The de identification algorithm replaces each PHI found in the medical notes with a PHI category tag In this section we list the PHI tags defined in the code Name The name filter replaces each name instance found in the medical notes with a PHI tag that indicates the type of name replaced e g first last female male In some cases the pattern used to detect the name is specified in parenthesis following the name type For example the tag Name PTitle indicates that the name matches patterns defined by plural titles such as Drs and Professors Example name PHI tags are as follow Known patient firstname Name matched the patient s first name listed in the dictionary Known patient lastname Name matched the patient s
14. shift ory Bibliography e Neamatullah I Douglass M Lehman LH Reisner A Villarroel M Long WJ Szolovits P Moody GB Mark RG and Clifford GD Automated de identification of free text medical records BMC Med Inform Decis Mak 2008 8 32 URL http www biomedcentral com 1472 6947 8 32 e Neamatullah I Automated De Identification of Free Text Medical Records MEng Thesis Massachusetts Institute of Technology Cambridge MA 2006 e Douglass M Computer Assisted De identification of Free text Nursing Notes MEng Thesis Massachusetts Institute of Technology Cambridge MA USA 2005 e Douglass M Clifford GD Reisner A Long WJ Moody GB Mark RG De identification algorithm for free text nursing notes Computers in Cardiology 6 2 2005 e Douglass M Clifford GD Reisner A Moody GB Mark RG Computer assisted de identification of free text in the MIMIC II database Computers in Cardiology M6 2 2004 10 APPENDIX A De Identification Code Description In this Appendix we document the major Perl subroutines in our de identification software The code was implemented in Perl version 5 8 8 and upgraded to version 5 10 and tested under Fedora Core 10 Linux 2 6 27 De identification involves scanning the entire text to identify PHI classifying each item of PHI based on the PHI categories and replacing it with a PHI category tag see Appendix B for the PHI category tags used by the software An input confi
15. t gt is the character index of the beginning of the word with the PHI The second lt PHI Start gt is the index of the beginning of the PHI selection In the Gold Standard corpus they are the same number The third number lt PHI End gt is the index of the last character selected as PHI Format for id phi This file has same format as id deid Format for id types The PHI type category file id types contains the category of each PHI that appears in the gold standard corpus PHIs are classified into the following categories PTName PTNamelnitial HCPName RelativeProxyName 4 Location Date DateYear Phone etc See Appendix B for a complete listing of the PHI types Some of the common PHI types are described as follows e PTName Patient names first middle or last names PTNamelnitial Patient name initials RelativeProxyName Names of patient s family members or proxies HCPName Health Care Professional Names doctors nurses hospital workers etc Location locations which include hospitals company names street addresses Date dates with day month year DateYear stand alone year without day or month Phone this includes telephone pager and fax numbers The format of the PHI category file is as follows lt Patient ID gt lt Record Number gt lt PHI Start gt lt PHI End gt lt Type gt where lt Type gt is one of the PHI categories Format for id info This file contains information on PHI locations and de ident
16. tly obtains and stores the PHI locations in the paragraph identified by findPHI The function finally calls the function outputText with hashes of identified PHI location to obtain the de identified text findPHI Arguments paragraph of text Returns hash of PHI found Called by deid Description Dispatched from the deid function that perform de identification Reads in a paragraph of text and runs the de identification filters on it Function synopsis The function splits the data text into items demarcated by spaces It performs an exact matching of each item with lists of known PHI e g proper names The function then calls each filter function e g name age date filters sequentially The function returns a hash of approved PHI outputText Arguments hash of PHI locations Returns None Called by deid Description Creates the de identified version of the text Replaces dates with shifted dates and other PHI with their PHI types Function synopsis The function prints all the identified PHI locations to the output file If a PHI is a date the function shifts the date and replaces it in the de identified text This shift may be a predetermined value or a random value If a PHI is not a date the function replaces it in the de identified text with a tag of the PHI type The function then prints the remaining non PHI text to the de identified text Thus the function outputs a de identified text file that is the o

User Manual

Contents

Download Pdf Manuals

Related Search

Related Contents