Home

Mining sequence data in R with the TraMineR package

1. dadda eama s 110 10 5 1 Looking after specific subsequences llle 110 10 5 2 Counting the number of occurrence in each event sequence 111 10 5 3 Selecting event subsequences a 112 10 5 4 Duration of event sequences eA 112 A Installing and using R 114 AL Obtaining and installing E 2 cnc ohana a wee ERS 114 do dI a aa BS oe a Se MS 114 ALS Data manipulation ii ku ee ea a eR RAR HR GER e boron a eR ES 115 A 3 1 Creating and printing objects ccc oo rikra seem a a 115 ARa Voco ss quee doe where esee Orte qi dq Fede decay o d 115 A 3 3 Data frames matrices andlists o s cecs aparecendo lens 116 A 3 4 Accessing and extracting data gt sso 222A 118 Ad DADOS 5 en rd aa R3 S X Po ORE E RO Heus 119 A 5 Some other useful Tunetions ino dex AAA ACA CK RE 120 Ad The apply onction dzzacoe A a RT Re 120 A52 The table function sia ts ee eh e X pee go WE 120 A 6 Creating and saving graphics 2 2 2 css o oom om m om m om RR Rn 120 A T Performance and memory usage o oco rs 121 B Installing TraMineR 122 B 1 Installing from binary package llle 122 Bill Widows oscila Wwe Rod oho wed A PUR Rud 122 B2 ing 2G ewe BEA NOR RE Re AAN 123 B 2 Installing from source package a w s uasai d d daaa aa a 123 B24 o WandGWs sa s ea a aiui Scape 3 a a RAR DS X XC ae 124 BAZ Inu io ek ce ee x ke x OA ME SAREE Rod ww EES Pa Se aus 124 6 CONTENTS C Information about T
2. this matrix are not associated to the transition from a state to each self but are just the starting event of the sequence If we omit this step information about the beginning of the event sequence will be omitted In our case we insert for example the event FullTime to each event sequence that begins with the state A To generate our own matrix we first use seqetm to assign correct column and rows names and then enter the content of our own matrix R gt transition lt seqetm actcal seq method transition R gt transition 1 1 4 c FullTime Decrease PartTime Decrease LowPartTime 5 2 Converting between formats 43 Stop i R gt transition 2 1 4 lt c Increase FullTime PartTime Decrease LowPartTime Stop R gt transition 3 1 4 c Increase FullTime Increase PartTime LowPartTime Stop R gt transition 4 1 4 lt c Start FullTime Start PartTime Start LowPartTime NoActivity R gt transition A B D A FullTime Decrease PartTime Decrease LowPartTime Stop B Increase FullTime PartTime Decrease LowPartTime Stop C Increase FullTime Increase PartTime LowPartTime Stop D Start FullTime Start PartTime Start LowPartTime NoActivity Once we have our event matrix we can convert our state sequence data set into the time stamped event TSE form by means of seqformat O R gt actcal tse lt seq
3. 44 Ch 5 Importing and handling longitudinal data with TraMineR 10 with partner married or not 11 with friends or in a flat share 12 alone 13 other situation 14 with both natural parents and the partner married married 15 with both natural parents and friends or flat share 16 with partner married or not and friends or flat share 17 missing values R gt bvla100_rec lt as integer LA bvla100 R gt table bvlai00 rec bvlai00 rec 4 6 FA 8 9 10 TX 12 13 14 15 16 110 875 49 148 14 1066 96 393 226 12 2 4 R gt LA lt data frame LA bvla100_rec We now convert the SPELL data into state sequences The minimal informations needed for importing data in SPELL format are described in table 5 3 If no options is specified the input data is supposed to comply this structure The user can alternatively specify which columns in the input data set contain the mandatory variables using the id begin end and status option or select the variables in the required order using the var option Table 5 3 Structure for the spell format Position Variable Option name 1 Personal identification number id 2 Start time begin 3 End time end 4 Status status Other options pertaining to the time axis definition and the handling of overlaps in the beginning and ending times of the successive spells are also available In the first example below we import the data with
4. R gt turb quant lt quantile biofam Turbulence c 0 0 1 0 45 0 55 0 9 1 R gt turb quant 0 10 45 55 90 100 1 000000 2 000000 4 697325 5 321928 6 915230 8 807355 and creating a categorical variable using the percentile values R gt turb group lt cut biofam Turbulence turb quant labels c Min g10 45 Median q55 90 Max include lowest T R gt table turb group 8 5 Composite measures of sequences complexity 91 turb group Min qi0 45 Median q55 90 Max 223 684 369 542 182 and keeping only the first third and fith levels of the variable R gt turb group lt factor turb group levels Max R gt table turb group turb group Min Median Max 223 369 182 c Min Median and next by plotting the frequency plots for each of the three intervals The plot is shown in figure 8 8 R gt segfplot biofam seq group turb group pbarw Min P e n 9 Q N Il e c 2 Xs E gt gJ o LTTTTTTTTTTTTTTI ai5 al8 a21 a24 a27 a30 Max 3s _ 4 S 8 i gt 2 p E o 0 CETTTTTTTTTTTTTTI ai5 al8 a21 a24 a27 a30 369 Cum freq n 6196 096 TRUE Median ETFTTTTTTTTTTTTI alb al8 a21 a24 a27 a30 Parent Left Married Left Marr Child Left Child Left Marr Child Divorced EBEBBUDUEH Figure 8 8 Low median and high sequence turbulences biofam data set 8 5 2 Weighted entropy Chapter
5. lt is used to assign a value to an R object and entering solely the name of the object prints its value on the output screen In the next example we first create or replace the object x by assigning it the value 2 and then display its content x 2 gt x 1 2 When printing the x object the output contains 1 in front of the values of x indicating that the line begins with the first element of the object In this case it hasn t much interest because x has only one element It may be useful however for objects containing more than one element such as vectors matrices or data frames that we describe hereafter A 3 2 Vectors In R vectors are very important Even objects containing one single value are vectors gt z co gt is vector z 1 TRUE 116 Appendix A Installing and using R Creating vectors with cbindO The widely used c or cbind function combines its arguments into a vector In the following example we use this function to create a vector with the previously created x and z objects gt c x z 1 2 4 Filling vectors with number sequences It is often useful to generate a vector of consecutive numbers This is easily done by using the sequence generating operator as shown in the following example gt seq lt 1 50 gt seq 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
6. maxGap The maximum time gap between two transitions windowSize The maximum window size that is the maximum time taken by a subsequence 108 Ch 10 Analysing event sequences 5 2 lt S 2 tos o _ 5622 o Ss oc EO f zog J S oL A D I 21105353 T amp m 0 o zs E 0 gt 7 Oo T ot eo t o 7A t ot 43202 Y 5s vo c A l qd A ES eg a c a com E n 9 t E amp zm xu S A 64 SIS 253 5 9 S o0 9 SB 5 2q E 222725 gt y 0 070510 3 2 3 3 D 5 2 2a E J ES 2454 o Or OD ZEA I 2A Cor A Ss U c y cid eI y Lo 2 e rae dg o a 2 HE e Figure 10 1 Frequencies of 15 most frequent event subsequences ageMin Minimum age at beginning of subsequences ageMax Maximum age at beginning of subsequences ageMaxEnd Maximum age at end of subsequences Each of these parameters is ignored when set equal to 1 which is their default value The following examples show how to set time constraints First we search for subsequences enclosed in a five year interval with no more than two years between two transitions R gt time constraint seqeconstraint maxGap 2 windowSize 5 R fsubseq seqefsub bf seqe pMinSupport 0 01 constraint time constraint R fsubseq 1 4 Subsequence Support Count Parent 0 986 1972 Parent gt Left 0 434 868 Left Marr gt Left Marr Child 0 286
7. neonpe JayBiy lt o0yos I T T T T vo c0 00 Pearson residuals Type 3 T S 5 tag pre B B B 4 uawAo dwe lt Bulule 00yos duje uorneonpe ieuoi a s0yO1y lt uauAojduo 24B14 lt ooy9s 100498 uyeonpe seybiy lt iooyos I T T T T 1 v0 20 00 Figure 2 6 A short example Most discriminating transitions between clusters mvad data Chapter 3 The TraMineR package TraMineR is an add on package to R providing a set of functions for describing visualizing and analysing sequence data together with example data sets The latter are used in this manual to demonstrate the multiple powerful features offered by the package TraMineR can be installed either from a precompiled binary package or from source files The latest versions for Linux 32 and 64 bits Mac OS X and Windows are available at http mephisto unige ch pub traminer or directly from the CRAN http cran r project org For more detail on how to install TraMineR see Appendix B p 122 This chapter describes the basic use of TraMineR and presents the included data sets that will be used in this manual to demonstrate the package capabilities 3 1 Loading using and getting help Loading Once you have installed TraMineR on your system you have to load it to access its functionalities This is done by means of the library command Typing R gt library TraMineR gives you access to the functions and dat
8. 1 As claimed above the sum of the transition rates from one state to all other states including the transition rate between the state and itself should equal 1 But we don t trust anybody and we want to check it We therefore apply the rowSums function which returns the sum of the rows to the tr object containing the transition rates R gt rowSums tr 72 Ch 7 Describing and visualizing sequences A gt B 5 I6 9 P gt 1 1 1 1 Of course there is a shorter way that leads to the same result R gt rowSums seqtrate actcal seq A gt B gt C gt D gt 1 1 1 1 7 2 5 Mean time spent in each state We may be interested in the mean time spent in each state TraMineR provides a special function called seqmtplot to visualize the mean time values In the next example we use this function together with the group option to visualize the mean times for each sex separately Figure 7 8 As for the other plotting functions the colors for representing the states are automatically retrieved from the sequence object R gt seqmtplot actcal seq group actcal sex title Mean time Mean time man Mean time woman 12 12 1 10 884 1116 Mean time n Mean time n J E E od Sel BN A B C D A B Cc D E 37hours 1 18 hours E 19 36 hours no work Figure 7 8 Mean time spent in each state actcal data 7 3 Describing and visu
9. 128 Appendix C Information about TraMineR content Bibliography Aassve A F Billari and R Piccarreta 2007 Strings of adulthood A sequence analysis of young british women s work family trajectories European Journal of Population 23 3 369 388 Abbott 2001 Time Matters On Theory and Methods Chicago Chicago Press Abbott A and J Forrest 1986 Optimal matching methods for historical sequences Journal of Interdisciplinary History 16 471 494 Agrawal R and R Srikant 1995 Mining sequential patterns In P S Yu and A L P Chen Eds Proceedings of the International Conference on Data Engeneering ICDE Taipei Taiwan pp 487 499 IEEE Computer Society Billari F C 2001 The analysis of early life courses complex descriptions of the transition to adulthood Journal of Population Research 18 2 119 24 Brzinsky Fay C U Kohler and M Luniak 2006 Sequence analysis with Stata The Stata Journal 6 4 435 460 Elzinga C and A Liefbroer 2007 De standardization of family life trajectories of young adults A cross national comparison using sequence analysis European Journal of Population Revue europ enne de D mographie 23 3 225 250 Elzinga C H 2006 Turbulence in categorical time series Mathematical Population Studies sub mitted Elzinga C H 2007 CHESA 2 1 User Manual Amsterdam Vrije Universiteit Elzinga C H 2008 Sequence analysis Metric representations o
10. 2 3 4 5 1 1 0000000 0 8164966 0 7071068 0 6324555 0 2 0 8164966 1 0000000 0 8660254 0 7745967 3 0 7071068 0 8660254 1 0000000 0 8944272 4 0 6324555 0 7745967 0 8944272 1 0000000 5 0 0000000 0 0000000 0 0000000 0 0000000 eOOO One can check that these values are equal to those in the upper triangle of Table 4 in Elzinga 2008 9 3 Longest Common Subsequence LCS distances The Longest Common Subsequence LCS based distance is another one of the metrics considered by Elzinga 2008 that is available through the seqdist O function The notion of subsequence is described in section 8 3 1 9 3 Longest Common Subsequence LCS distances 95 9 3 1 LCS based metric Let S x y be the nonempty set of subsequences of sequences z and y The proposed LCS metric is based on the length of the longest element of S Let us take the example in Elzinga 2008 consisting of 3 family formation histories with the meaning of the states being the same as in the famform data set see subsection 3 2 4 R gt LCS ex Sequence 1 S U S M S U 2 U S SC MC 3 S U M S SC UC MC For convenience we derive from LCS ex 3 distinct sequence objects x y and z containing each one sequence R gt x LCS ex 1 R gt y LCS ex 2 R gt z LCS ex 3 The length of the longest common subsequence of the first pair of histories x y can be computed with the seqLLCS O function R gt seqLLCS x y 1 2 The l
11. 7 998 000 whereas it is 10000 9999 2 49 995 000 for 10000 sequences With R the size of the data you can handle is limited by the available memory size on your system at least on Linux systems Remember that from the moment that you compute a dis tance matrix the requested memory size increases dramatically TraMineR has been succesfull in computing the distance matrix for as much as 30328 sequences but the size of the half distance matrix was 6 85GB To give a more common example computing optimal matching distances for the 4318 sequences of length 16 841 distinct sequences of the original data set from which biofam was extracted takes less than 15 seconds on a dual core processor The resulting 4318 x 4318 dis tance matrix has a size of 142Mb Table 3 6 gives computation time and memory usage for some 3To reduce the number of distances to compute TraMineR first selects the set of unique sequences 3 3 Performance and memory usage 25 typical examples The reported computation times concern version 1 0 of TraMineR Improvements in version 1 1 permitted to reduce the indicated times by a factor of at least 10 for large data sets If you get some message claiming about a lack of memory you should try gc O to free memory from garbages that may be produced by some memory consuming functions The computation of distances between sequences was faster with version 2 6 and 2 7 of R compared with version 2 5 Table 3 6 Perform
12. E higher education SE joblessness 3 E school B training N ES al e TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTI Sep 93 Jul 94 May 95 Apr 96 Mar 97 Feb 98 Jan 99 Figure 2 3 A short example State distribution within each cluster mvad data 16 Ch 2 A short example to begin with Type 1 FS Hed EE eee P ETETTTTTTITITITITITITITITTTTTTTTITITITITITITITTTTTTTTTTITITITITITITITT e EEEEFEEFFEEEEEEFEEEEERER T1TTTTETTLETELETETTTEETTTETETTEETTETTTTTTITTITI EEEEEEEEEEEEEEEEEEEDET T TT TTTTTTTTTTTTTTTTTTTTTTTTTETTTTTTTHTITTTITITH EEEEFEEEPEEEET AAA T EE o a OOO I Ss g x a 3 o x e Sep 93 Jul 94 Apr 95 Feb 96 Dec 96 Oct 97 Jul 98 Apr 99 Type 3 x 5 o N Il El x E 3 o 0 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTITI Sep 93 Jul 94 Apr 95 Feb 96 Dec 96 Oct 97 Jul 98 Apr 99 153 Cum freq n 41 296 096 Type2 g Sep 93 Jul 94 Apr 95 Feb 96 Dec 96 Oct 97 Jul 98 Apr 99 employment further education higher education joblessness school training Beo0o0naa Figure 2 4 A short example Sequence frequencies whit
13. data and to D McVicar and M Anyadike Danes for the permission regarding the mvad data set they used in an article of the Journal of the Royal Statistical Society Those data sets are included in the TraMineR package and are used for illustrating this user s guide Reporting bugs We have indeed carefully tested the package Nevertheless we cannot exclude that there remain programming errors and encourage you to report any bugs you may encounter to the package maintainer who is presently alexis gabadinhofunige ch You will thus contribute to improve the package Referencing TraMineR Thank you for citing this User s guide i e Gabadinho A G Ritschard M Studer and N S M ller Mining sequence data in R with the TraMineR package A user s guide University of Geneva 2008 http mephisto unige ch traminer when presenting analyses realized with the help of TraMineR Contents 1 Introduction 1 1 Aims and features of the TraMineR package o e 2 A short example to begin with 21 State sequence Analysis o pesos a Aa da 2 2 Event sequence analysis sc oo cando e aor a 00 ra voy o9 ROXCE CR eed 3 The TraMineR package 3 l Loading using and getting help 3 2 Data sets included in the TraMineR package llle mol Phe gerea dab SOL codncssces9o9 ei RO A aa m A RO 32 2 The Rajam data seb 2 pe ae ee ee AAA RA 2 2 9 Phemood Gata Set en sa Re Ee A BORE RY a EAA RA EES 3 2 4
14. labels option The process option which is passed to the seqformat function is set to FALSE that is the time axis for the sequences is defined as a calendar time axis whose start and end are the minimum and maximum values found in the begin bvla013 and end bvla013 columns of the input data set R gt LA labels lt seqstatl LA bvlail00 R gt LA states lt 1 length LA labels R gt LA seq lt seqdef LA var c idpers bvla013 bvla014 bvlai00 informat SPELL states LA states labels LA labels process FALSE Now we can display in SPS format the resulting sequences By setting the process option to FALSE the sequences have been created using a calendar time axis see 4 1 3 and 5 2 2 ranging from 1914 to 2002 Hence most of the sequences begin with missing states R gt print LA seq 1 15 format SPS Sequence 1 51 4 24 10 14 2 54 4 17 1 4 10 14 3 47 4 17 8 5 12 1 8 1 12 14 1 3 4 59 4 20 10 10 5 89 6 2 Attributes of sequence objects 51 6 11 6 28 2 50 7 57 4 18 1 4 10 5 1 2 10 3 8 67 4 5 9 11 8 3 10 3 9 51 4 22 1 4 10 5 1 7 10 5 4 20 10 59 7 5 11 59 4 24 10 6 12 61 4 22 10 6 13 53 4 24 10 12 14 43 4 21 10 25 15 45 4 17 1 9 10 18 If we want the sequences to be defined on a process time axis we need
15. next example shows how to store an histogram of 1000 random generated numbers drawn from the normal distribution in the myplot pdf file gt pdf file location myplot pdf gt hist rnorm 1000 gt dev off A 7 Performance and memory usage 121 There are a lot fine tuning parameters that can be used to set the output page size font sizes etc Check the available options with pdf or ps Note that there are similarly pngO jpegQ tiff and some other functions for producing graphics in other formats A Performance and memory usage In R objects are stored in memory The size and number of objects you can handle is limited by the memory size If you don t further need an object you can free memory by deleting it with the command rm objectname For example a sequence data containing 4318 rows of 16 states needs 0 52 Mb 541kb Appendix B Installing TraMineR B 1 Installing from binary package Binary versions of TraMineR are the easiest to install Such binary versions are available for Linux MacOS X and Windows Once the TraMineR package will be available from the CRAN archive which may not be the case at the time you read this manual the most straightforward way will be to install it from the CRAN B 1 1 Windows Installing from the CRAN Once the TraMineR package is available from the CRAN archive it may not be the case at the time you reed this manual the easiest way to install it is from the Packages
16. 1 000 12 0W0 9 0WU 5 1WU 2 2 000 12 0W0 14 1WU 2 R gt sp ex1 lt seqdef sp ex1 informat SPS Now sp ex1 is a sequence object Its content is displayed below in STS format R gt print sp ex1 ext TRUE 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1 000 000 000 000 000 000 000 000 000 000 000 000 OWO OWO OWO OWO OWO 2 000 000 000 000 000 000 000 000 000 000 000 000 OWO OWO OWO OWO OWO 18 19 20 21 22 23 24 25 26 27 28 1 OWO OWO OWO OWO OWU OWU OWU OWU OWU 1WU 1WU 2 OWO OWO OWO OWO OWO OWO OWO OWO OWO 1WU 1WU We use the seqST function to compute the turbulence R gt seqST sp ex1 Turbulence 1 6 813988 2 5 292438 Let us now compute the turbulence of the sequences in the biofam data set As for the entropy we add a new Turbulence variable with the values of the turbulences to the data frame Note how this time pass the output of the seqdef function on the fly to the seqST function R biofam data frame biofam seqST biofam seq To get a first idea of the turbulence distribution we summarize the created variable with the summary function The mean turbulence is 4 8 with a minimum of 1 and a maximum of 8 807 R gt summary biofam Turbulence Min 1st Qu Median Mean 3rd Qu Max 1 000 3 691 5 064 4 800 6 222 8 807 We get an histogram for the turbulence of the sequences with the command below yielding Fig
17. 10 12 0 6 22 29 12 22 14 5 16 8 6 6 0 22 22 8 16 10 6 14 30 18 22 22 O 14 20 10 32 7 14 30 18 22 22 14 0 20 14 32 8 14 14 4 12 8 20 20 o 16 18 9 4 22 12 22 16 10 14 16 0 22 10 20 6 16 14 10 32 32 18 22 0 9 3 3 LCS distances with internal gaps To compute LCS distances between sequences containing gaps see section 6 5 one can use the with miss TRUE option In that case missing states are considered as an additional valid state Let us illustrate this with the an example sequence object R gt ex2 seq Sequence 1 A B C D 2 A B C D 3 A B C D A Computing LCS distances with the with miss TRUE option yields the following result R seqdist ex2 seq method LCS with miss TRUE 1 L 2 1 3 1 0 1 2 25 1 0 1 3 2 1 0 3Recall that you can get normalized distances with the norm TRUE option 9 4 Optimal matching OM distances 97 According to the formula above the LCS distance between s2 and s3 is dr s2 53 s2 s3 242 52 s3 5 6 2 5 with Az s2 53 the longest common subsequence of s2 53 being of length 5 i e the length of A B C D 9 4 Optimal matching OM distances Optimal matching generates edit distances that are the minimal cost in terms of insertions dele tions and substitutions for transforming one sequence into another This edit distance has first been proposed by Levenshtein 1966 and has been popularized in the social sciences by Abbott Ab
18. 5 5 6 5 7 4 8 5 9 4 10 4 A 4 R libraries 1 OR ODPFPOAN du 3 w NU www WW Ww ROHDHPpPOoRNNO 5 1 PRPRPRPrRPRPRP PB aorPoORPN POW A 4 0 o0o0oo0o0o000o0 O PRNNO0OR NNNN 2 setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa When launching R you have access to a set of basic functions You may access additional and more sophisticated functions by explicitly loading add on packages with the library function Some of these add on packages libraries may be installed by default on your system which is for example the case of the foreign library for importing data sets stored in various formats such as Stata SAS or SPSS In this case you just have to issue gt library foreign 120 Appendix A Installing and using R to access the functions provided by the package In order to use add on packages like TraMineR that are not installed you need indeed to first install them on your system A large number of official and contributed add on packages are avail able on the CRAN http cran r project org src contrib PACKAGES html For installing any of these packages you can just issue an install packages command gt install packages package name within an R console and choose a mirror close to you in the menu For installing other packages that are not distributed through the CRAN like TraMineR for the moment you have to get the package source or binary file
19. A B B B B D D A A A A A A A A C C C C C C C seglength ex1 C Length 10 10 10 8 8 D Now we set all options to NA and all missing values are considered as part of the sequences R gt ex1 D seqdef exi left NA gaps NA right NA R gt ex1 D 62 Ch 6 Creating sequence objects Sequence 1 A A A A A A A A A A 2 D D D B B B B B B B 3 D D D D D D D D D D 4 A A B B B B D D 5 A A A A A A A A 6 C C C C C C C R gt seglength exi D Length 1 13 2 13 3 13 4 13 5 13 6 13 Here is the SPS representation of the previous example R print exi D format SPS Sequence 1 3 A 10 2 D 3 B 7 3 3 1 D 10 2 4 4 2 2 B 4 D 2 3 5 4 12 C 1 4 4 C 12 A 3 6 3 8 06 9 00 70 705 9 Chapter 7 Describing and visualizing sequences This chapter presents the main TraMineR tools for describing and visualizing sequences We first briefly explain in Section 7 1 the general plotting philosophy adopted in TraMineR Section 7 2 presents then tools for describing and visualizing set properties of the sequences from an aggregated standpoint and Section 7 3 focuses on the characterization of individual sequence properties and their summary 7 1 General principle of TraMineR sequence plots TraMineR provides three basic plotting functions for visualizing sequence characteristics se
20. Ch 10 Analysing event sequences of subsequences which includes the reference to the event sequences they were extracted from a method and optionally a list of constraints The method specifies the information we want Possibilities are count default age the age at first occurrence of a subsequence and presence which returns a matrix with ones indicating the presence of the subsequence and zero otherwise In the example below we count for each sequence how many time it contains each subsequence The result is a matrix with rows corresponding to the sequences columns to the specified subse quences and cell values equal to the requested counts R gt msubcount lt seqeapplysub mysubseq method count R gt msubcount 1 3 Parent Left Marr Parent 9 00 Left Marr 1 00 Left Marr Child 6 00 hi Parent 1 00 Left 10 00 Left Marr 1 00 Left Marr Child 4 00 1 Parent 7 00 Left 5 00 Left Marr 1 00 Left Marr Child 3 00 1 Parent Left Left Marr Parent 9 00 Left Marr 1 00 Left Marr Child 6 00 Parent 1 00 Left 10 00 Left Marr 1 00 Left Marr Child 4 00 Parent 7 00 Left 5 00 Left Marr 1 00 Left Marr Child 3 00 10 5 3 Selecting event subsequences The function seqecontainO permits to select a set of event sub sequences containing spe cific events It checks whether a given subsequence contains given events For instance we may want to select frequent subsequences containing one of the
21. Each line contains information about an individual at a different time unit There is one line for each time unit where the individual lOriginal personal identification numbers have been modified 4 3 Definition and properties of categorical sequences 33 is under observation Such data presentation is mainly used for discrete survival models where the focus is on a specific event leaving home childbirth death end of job etc and the time periods considered are those where the cases are under risk of experimenting the event In that case each record contains at least the time stamp and a status variable indicating if the event under study occurred in this time interval and may possibly be completed with the values of some covariates 4 2 6 The shifted replicated sequence format SRS This data presentation is intended for mobility analysis where the concern is the transition from the states observed at previous time points t 1 t 2 to the one observed at time t Consider for example the sequence A A C D D where the first element in the sequence corresponds to year 2000 and the last one to year 2004 The shifted replicated sequence representation of this sequence is obtained as follows R gt seqs lt data frame y2000 A y2001 A y2002 C y2003 D y2004 D R gt segs y2000 y2001 y2002 y2003 y2004 1 A A C D D R seqformat seqs from STS to SRS id idz T 4 T 3 T 2 T 1 1 1
22. NA NA NA NA 2 NA NA NA A 3 NA NA A A 4 NA A A C 5 A A C D oP WN FE hee VU In this presentation we collect in the columns named T 1 and T all subsequences between t 1 and t and hence all observed transitions between t 1 and t This is useful when we want t to be a relative time point rather than an absolute date 4 3 Definition and properties of categorical sequences The next parts of this manual are dedicated to the analysis of categorical sequences We define here more precisely as well as some important concepts such as subsequences 4 3 1 Categorical sequences For formal definition we may follow for example Elzinga and Liefbroer 2007 First define an alphabet A as the list of possible states or events A sequence x of length k is then an ordered list of k successively chosen elements of A It is often represented by the concatenation of the k elements A sequence can thus be written as z423 z 4 with x A We use commas when necessary for distinguishing successive elements in a sequence For instance x S U M MC stands for the sequence single with unmarried partner married married with a child 34 Ch 4 Definition and representation of longitudinal data formats 4 3 2 Time axis In addition to the sequencing of states or events that the above definitions account for the infor mation about sequences especially those describing life courses in
23. NA NA N NA NA NA 5 12 NA NA NA NA NA NA N N NA NA NA 6 1 11 NA NA NA NA NA N N NA NA NA Note that durations are stored in a matrix with a number of columns equal to the maximum sequence length encountered This is because in a sequence of length 12 for instance there can be at most 12 possible distinct states 8 3 Summarizing the DSS 8 3 1 Number of subsequences The idea of subsequence is an extension of the notion of substring and is described in detail for instance in Elzinga 2008 While a substring of a sequence is necessarily constituted of adjacent symbols this requirement is relaxed with the notion of subsequence Thus if x abac A the empty string u b v bac and w bc belong to the set of subsequences of x while only A u b and v bac are substrings of x The seqsubsn function returns the number of subsequences contained in a sequence R gt seqsubsn head actcal seq 78 Ch 8 Sequence characteristics and associated measures Subseq 1 2 3 4 5 6 e NANNY 8 3 2 Number of transitions Computing the length of a sequence s DSS yields immediately the number of transitions contained in the sequence This is illustrated below with the mvad data set We first have a look at the first 10 sequences R gt print head mvad seq format SPS Sequence 1 TR 2 EM 4 TR 2 EM 64 2 JL 2 FE 36 HE 34 3 JL 2 TR 24 FE 34 EM 10 CJL 2 4 TR 49 EM 14 CJL
24. O 67 119 plot all individual sequences 71 legend 62 selected sequences 71 sequence frequency 68 state distribution 64 plot 119 postscript O 67 print 48 54 print 54 psO 119 qO 113 range O 97 rbind 48 read csv 36 read delim 36 read dta 34 35 read fwf 36 read spss 34 35 read table 36 right 58 rm objectname 120 round 98 rownames 117 rowSums 70 sep 39 seqconc 38 seqdecomp 38 seqdef 48 seqdef O 18 37 38 46 47 50 86 87 seqdist 91 93 95 97 seqdplot 64 65 segdss O 76 86 seqdur 76 seqeapplysub 110 seqecmpgroup 108 seqeconstraint 106 segecontain 111 seqecreate 18 seqefsub 106 segelength O 111 seqesetlength 111 seqetm 40 41 105 seqfcheck 37 seqformat 37 39 42 seqfplot 68 segient 78 80 seqiplot 71 83 seqistatd 78 79 seqlegend 62 seqlength 75 seqLLCP 92 seqLLCS OQ 94 seqmpos 91 seqmtplot 71 seqpm 73 seqST 86 87 seqstatd 65 67 segstatl 53 seqstatlO 53 64 seqsubm 96 98 seqsubsn 76 seqsubsn 86 seqtab 68 seqtrate 70 96 sequence definition 10 formats 28 object 46 of events 16 104 of transitions 16 104 INDEX 133 SHP 19 shp0_bula_user dta 35 source 113 sp ex1 86 space 0 72 SPS in 47 start 53 55 state distribution 64 state labels attaching 53 states 49 51 states 1 12 51 status 43 subsequence
25. TITTTTTTTTTTTTTTI ai5 al8 a21 a24 a27 a30 Figure 9 2 Sequence frequencies by cluster biofam data set Another help to characterize the patterns within each cluster is to plot the mean times spent in each state R seqmtplot biofam seq group cluster3 104 Ch 9 Measuring similarities and distances between sequences Type 1 Type 2 c N o XA o N 2 5 o g E o oO E E c c e m 7 e O _M_ 123 45 6 7 123 45 6 7 Type 3 2 o T Parent o 7 O Left ioe E Married D O Left Marr E o Bl Child c E Left Child o El Left Marr Child E aen B Divorced 0123 45 6 7 Figure 9 3 Mean time in each state by cluster biofam data set Chapter 10 Analysing event sequences The previous chapters dealt essentially with sequences of states Here the focus is on sequences of transitions or events TraMineR offers specific tools for such kind of data that permit among others to mine frequent event subsequences Studer et al 2008 Agrawal and Srikant 1995 Zaki 2001 The TraMineR functions intended for sequences of events start with the seqe prefix which stands for SEQuence of Events The concept of event sequence and its formalization were introduced by Agrawal and Srikant 1995 who were mainly interested in frequent buying sequences We retain here the notation of Zaki 2001 but introduce a new terminology that we think is more appropriate for social sciences In this chapter
26. content of a text file with such data and some covariates 06 0 896 20 2 0 4 4 M 44 MC 9 SC 91 X 07 0 967 20 1 0 4 1 8 66 U 10 M 12 MC 56 08 0 967 2010 4 4 S 72 U 5 M 67 10 0 896 20 2 0 4 1 S 10 U 1 UC 133 27 0 967 20 1 0 4 4 S 54 U 18 S 15 U 11 M 29 MC 17 30 0 896 20 2 0 4 2 S 10 U 14 M 8 MC 112 The first step is to import the text file into an R data frame We specify that there are no variable names in the first row with the header FALSE option that the rows may have unequal length with the fill option and that empty strings should be treated as missing values with the na strings R gt sweden lt read table file data sweden txt header FALSE sep na strings fill TRUE The sequence data is contained in columns 8 to 13 Note that sequences are stored in an unequal number of variables depending on the number of distinct states the individuals passed through R gt head sweden vi V2 V3 V4 V5 V6 V7 V8 v9 V10 Vii V12 V13 1 06 0 896 20 2 0 4 4 M 44 MC 9 SC 91 NA NA NA 2 07 0 967 20 1 0 4 1 S 66 U 10 M 12 MC 56 NA NA 3 08 0 967 20 1 0 4 458 72 U 5 M 67 NA NA NA 4 10 0 896 20 2 0 4 1 S 10 U 1 UC 133 NA NA NA 5 27 0 967 20 1 0 4 4 S b4 U 18 S 15 U i1 M 29 MC 17 6 30 0 896 20 2 0 4 2 8 10 U 14 M 8 MC 112 NA NA Now importing this data into a sequence object is very straightforward We set the informat SPS option since the d
27. definition 33 LCS 93 subsequences of events 106 subset 88 summary 53 80 83 87 116 support minimum 104 tableO 119 tevent 105 time reference 27 tlim 68 72 to 44 to data frame 35 tr 70 TraMineR installation files 121 122 transition 104 transition rates 70 TRUE 99 turbulence 84 update packages 19 var 43 46 47 weighted entropy 90 which 80 with miss 95 99 100 withlegend FALSE 62
28. displayed in a compressed format i e as character strings where the states are separated with the symbol But internally each state is still stored in a single variable as shown with the print command with the extended TRUE option R gt print actcal seq 1 5 3 8 ext TRUE mar00 apr00 may00 jun00 jul00 augOO 1 B B B B B B 2 D D A A A A 3 B B B B B B 4 e 6 C C e C 5 A A A A A A We get a more concise view of sequences with the SPS state permanence representation Obviously the SPS format yields shorter and more readable sequences We obtain the SPS representation with the format SPS option R gt print actcal seq 1 5 3 8 format SPS 56 Ch 6 Creating sequence objects Sequence 1 B 6 2 D 2 4 4 3 B 6 4 C 6 5 4 6 When using subscripts to select only parts of sequence objects the result is still a sequence object and all attributes of the parent object are preserved inherited As an example the sequences for the summer months only are selected from the previously created ac cal seq sequence object We see that the color palette cpal attribute and state labels 1abel attribute have been preserved while the start attribute originally set to 1 default value has been updated to 6 R actcal summer actcal seq 6 9 R gt attr actcal summer cpal 1 47FCOTF BEAED4 FDCO86 FFFF99 R gt attr actcal summer labels 1 gt 37 hours 19 36
29. easier to understand by you and by others when the name of each used argument is explicitly specified The seqdef function is used to illustrate how to specify arguments This command is one of the first you will issue since it defines the sequence object requested by most of the other functions provided by the TraMineR package The main arguments of seqdef are e data the name of a data frame e var which specifies the variables names or index numbers of columns containing the se quence information default value is NULL meaning all the variables in the data set e informat which specifies the format of the sequences default value is STS the most common sequence format The function seqdef accepts additional arguments stsep alphabet states start missing cnames that are described later in this manual see Chapter 5 The name of the data frame is mandatory but the other arguments have default values and can be omitted if their values are suitable to you The options can be given in any order if you specify the argument names before their values R gt data actcal R gt actcal seq lt seqdef var 13 24 data actcal In this example not specifying the argument names var and data generates an error message Getting help To get help about a specific function seqdef for instance type R gt gt seqdef or R gt help seqtab Updating and new features The update packages function can be
30. for instance useful for mining frequent event sub sequences 4 1 5 Ontology An ontology of sequence data formats can be defined by a nested suite of yes no questions about properties of the format Figure 4 2 shows an ontology of types of longitudinal data i e data organized according to time 4 2 Longitudinal data representations 29 4 2 Longitudinal data representations Using some elements of the ontology Table 4 1 defines several data formats The basic information used to identify them is whether the elements are states or events and whether the format uses a single row or more than one for each case Table 4 2 gives examples of the listed formats The latter as well as some other formats are described in details below in the present Section with indication of whether they are supported by TraMineR 4 2 1 The states sequence STS format The STates Sequence STS format is the internal format used by TraMineR in TraMineR se quences are stored in sequence objects see next section It is one of the most intuitive and common way of representing a sequence In this format the successive states statuses of an individual are given in consecutive columns Each column is supposed to correspond to a predetermined time unit but sequences of states with no time reference can be handled as well using the same format In the actcal data set previously described see Sec 3 2 1 sequences are in columns 13 to 24 representi
31. gt seq exi lt seqdef seq ex1 informat SPS R gt seq ex1 1 15 Sequence 1 000 000 000 000 000 000 000 000 000 000 000 000 OW0 OWO OWO 2 000 000 000 000 000 000 000 000 000 000 000 000 OW0 OWO OWO By default sequence objects are displayed in STS format when typing their name the print method is called with default parameters At first glance the two sequences do not seem to be very different However the difference shows up clearly when displaying them in the SPS format R gt print seq exi format SPS Sequence 1 000 12 0W0 9 OWU 5 1WU 2 2 000 12 0WO 14 1WU 2 6 1 2 Creating a sequence object from SPELL formatted data Data in the SPELL format can be directly converted into a sequence object with the infor mat SPELL option The required data structure and options for importing spell data are de scribed in more detail in section 5 2 2 The same SPELL data extracted from the Swiss Household Panel retrospective survey will be used here as an example The original data containing living arrangement history see Table 4 3 on page 32 for the state description has been imported into R see Section 5 1 1 The living arrangement histories for the first two individuals id 2713 and 2714 are displayed below R gt LATI 9 J 50 Ch 6 Creating sequence objects idpers bvla idx bvla013 bvla014 bvlai00 1 2713 1 1965 1989 with both natural parents 2 28 2 1989 1990 with partner married or not
32. hours 1 18 hours no work R gt attr actcal summer start 1 6 The column names are retrieved with the names function R gt names actcal summer 1 jun00 jul00 aug0o sep00 6 5 Truncations gaps and missing values The handling of truncations gaps and missing values in sequence data received only little attention the literature In this section we present the effort made to consider this topic and the available features in TraMineR 6 5 1 Introduction To outline how we can handle missing values truncations and gaps in sequences with TraMineR we focus on sequences stored in the extended STS format see Chapter 4 This is one of the most common way of storing sequences that TraMineR users may encounter and also the internal storage format for sequence objects in TraMineR Each sequence is stored in a row of a rectangular matrix and each row has the same number of elements However for several reasons sequences in a data set may have different lengths or may not begin and end at the same column positions in the matrix For example e Sequences defined as the list of successive states without duration information are typically of varying length 6 5 Truncations gaps and missing values 57 e In event sequences the number of events experienced by each individual differs from one individual to the other e The length of the follow up is not the same for all individuals or sequences may be right or le
33. it is translated into megabytes by dividing it by 1024 100 Ch 9 Measuring similarities and distances between sequences 9 4 4 LCS distance as a special case of OM distance The LCS distance is equal to the Optimal Matching distance computed with an indel cost of 1 and a constant substitution cost of 2 Let us verify it with the ex3 seq sequence used previously R gt ex3 seq Sequence 1 A B C D 2 A B B D 3 A B C D D 4 A B C D Optimal matching distances were produced with a constant substitution cost of 2 the default value and an indel cost of 1 and stored in the ex3 OM matrix R gt ex3 0M 4 E 2 L 3 LA 12 0 2 1 0 25 2 0 3 2 3 4 1 3 0 1 4 0 2 1 0 Now we produce the LCS distance matrix R ex3 LCS seqdist ex3 seq method LCS R gt ex3 LCS 1 2 1 3 4 15 0 2 1 0 2 2 0 3 2 3 1 1 3 0 di 4 0 2 1 0 We can see that these LCS distances are the same as the OM ones However since we may not rely on human brain to compare the two matrices we look for a way of checking this more rigorously This is done with the a11 equal function R gt all equal ex3 0M ex3 LCS 1 TRUE 9 4 5 Optimal matching with internal gaps If missing values internal gaps are present in the sequences see 6 5 one can nevertheless compute distances by setting the with miss to TRUE In that case the substitution cost matrix must contain one additional entry for the missing state Let us us
34. mvad dataset e 14 A short example State distribution within each cluster mvad data 15 A short example Sequence frequencies whithin each cluster mvad data 16 A short example Frequencies of most frequent transitions mvad data 17 A short example Most discriminating transitions between clusters mvad data 18 First 10 sequences of the actcal data first at bottom llle 27 Ontology of types of longitudinal data gt s s sers sase tasad iad ns 29 Legend plotted as an additional graphic e s soa s oa scs w wa eaaa a a a oa e a 64 Distribution of the statuses by age in the mvad data set data from McVicar and Anyadike Dan s 2002 z m emm aa ee AA 66 Distribution of the work statuses by month in the actcal data set data from the Swiss Household Panel i229 ee 5 4G oe 3 Ro x Ro n Ron ER ACA eS 67 Entropy of state distribution by age biofam dataset 68 Entropy of state distribution by age actcal dataset 69 Plot of the 10 most frequent sequences in the actcal dataset 70 Plot of the 10 most frequent sequences in the biofam data set with proportional bar WEBS vo cb oe bbe Hee oa had X 4443 43 335992929323 70 Mean time spent in each state actcal data len 72 Plot of the 10 first sequences of the actcal dataset 73 Plot of all sequences of the mvad data set grouped according to the gcse5eq variable 7
35. object from a state sequence object To illustrate assume we are interested in analysing frequent transitions occurring in the family life biofam data set We first create the state sequence object R gt data biofam R gt bfstates lt c Parent Left Married Left Marr Child Left Child Left Marr Child Divorced R gt biofam seq lt seqdef biofam 10 25 states bfstates labels bfstates and convert it into an event sequence object R gt bf seqe lt seqecreate biofam seq By default segecreate creates a distinct from state gt to state event for each found transition This behavior can be modified through the tevent argument When set to state a single to state event start event of the spell in the given state is assigned to each transition see below R gt bf seqestate lt seqecreate biofam seq tevent state With the tevent period option a pair of events end state event start state event is assigned to each transition R gt bf seqeperiod lt seqecreate biofam seq tevent period You can also provide a custom transition matrix specifying the set of events that define each transition that is the set of events that are supposed to occur when a transition is observed It may be useful to use one of the transition matrices automatically produced by the seqetm function as starting point for designing a custom matrix To illustrate how resulting event sequences look out
36. object separately for the values of the gese5eg variable using the group option The border NA option specifies that the borders of the bars are not plotted and the space 0 option that the bars representing individual sequences are plotted without space between them yielding a more clean graphic when a large number of sequences are plotted The group option can be useful to distinguish patterns depending on a covariate value Here the right sequences correspond to young people who gained higher qualifications by the end of compulsory education and whose large proportion will continue school up to higher education We observe that the color corresponding to higher education is much more frequent in the right plot while the colors corresponding to training and employment are much more frequent in the left plot This plot of individual sequences complements the averaged representation provided by the state distribution plot by rendering the diversity of the sequences However such index plots for thousands of sequences result in very heavy graphic files if they are stored in PDF or POSTSCRIPT format To reduce the size we suggest that you save in that case the figures in png format by using png instead of postscript or pdf O Figure 7 10 was produced as a PNG plot with the following commands R gt png file Graphiques mvad seqiplot all png width 1600 height 1200 pointsize 50 3binary dummy indicating qualifications gaine
37. read delim read fwf See http cran r project org doc manuals R data pdf for details An example on how to read a comma separated CSV text file is given below with the mvad data set described in Section 3 2 4 p 24 The file can be freely downloaded from http www blackwellpublishing com rss Volumes Avi65p2 htm Though the data set is provided with TraMineR as an R data frame we show below how it was converted and prepared The steps are the following 1 Convert the downloaded xls file into a csv Comma Separated Values file using for example Excel or OpenOffice 2 Run R and type R gt mvad read csv file data McVicar csv header TRUE where you should indeed adapt the path data to the csv file The text file contains only variables with numeric values but most of them are indeed binary indicator variables see Table 3 5 Let us take an example with the male indicator variable For the moment this variable is stored as numeric and summarizing it yields quantiles of its distribution R summary mvad male Min 1st Qu Median Mean 3rd Qu Max 0 0000 0 0000 1 0000 0 5197 1 0000 1 0000 Hence we convert all indicator variables into factors R yn c no yes R mvad male factor mvad male labels yn R mvad catholic factor mvad catholic labels yn R mvad Belfast factor mvad Belfast labels yn R mvad N Eastern factor mvad N Eastern labels yn R m
38. representation consists in listing the events experienced by each individual together with the time at which the events occurred Sequences of events can easily be constructed from this representation It is also possible in TraMineR to translate sequence data into such time stamped event TSE representation at the cost however of providing event definition information see Section 5 2 2 page 41 Each record of the TSE representation usually contains a case identifier a time stamp and codes identifying the event occurring In the following example 3 events coded 5 7 and 9 are observed at age time 25 for the individual 70102 Individual 215102 experiences one event 1 at age 6 two events 5 17 at age 21 two events 7 18 at age 22 and two events 8 13 at age 25 R gt TSE exi id time event 1 70102 25 2 70102 25 3 70102 25 4 215102 6 b 215102 21 6 T 8 9 Oe ONA 215102 21 17 215102 22 7 215102 22 18 215102 25 8 10 215102 25 13 32 Ch 4 Definition and representation of longitudinal data formats Table 4 3 Living arrangements SHP State Description with both natural parents with one parent and his her new partner with one parent alone with relatives or in a foster family with partner married or not with friends or in a flat share alone other situation with both natural parents and the partner married married with both natural parents and
39. s1 Settings for handling missing values in this part of the sequence are defined with the right option When defining a sequence object the user can specify the way he wants to handle the elements indexed by each of these three vectors The options for each part are set with the arguments left gaps and right Each of them accepts the following values e DEL for deleting the NA s meaning that they do not belong to the sequence Missing values become thus void When necessary for maintaining the row length a special character by default will be inserted on the right of the sequence for each such deleted missing value e NA nothing is done and the each missing value is left as an explicit missing element For the output missing values are coded with a special character by default that is more convenient than NA for displaying the sequences Default values are left NA gaps NA and right DEL We demonstrate how the different options work on our example data First we leave the default settings unchanged With those settings all missing values encountered after the last valid state in a sequence are considered as void elements i e the sequence is considered as ending after the last valid state R gt ex1 A lt seqdef ex1 R gt ex1 A 4 Jast means the rightmost 5The code used in the input data for missing values can be set with the missing option The default is NA the usual way of representing missing
40. s4 containing only missing values R gt el lt cd dy 1 1 2 2 2 2 9 95 5 9 R gt s2 lt rep NA 12 R gt ms lt rbind si si s2 s2 R gt ms seq lt seqdef ms left NA gaps NA right NA R gt ms seq Sequence 1 1 1 1 1 2 2 2 2 3 3 3 3 2 1 1 1 1 2 2 2 2 3 3 3 3 3 x x x x k x x x k 4 x x x x x k x x k k and compute the OM distances R subm seqsubm ms seq method CONSTANT with miss TRUE R gt seqdist ms seq method OM sm subm with miss TRUE 102 Ch 9 Measuring similarities and distances between sequences 111 E 2 L3 LA Di 0 0 24 24 2 0 0 24 24 3 4 24 24 0 0 4 24 24 0 0 We can see that the distance between s3 and s4 is 0 hence they are considered as identical as are sl and s2 9 5 Clustering distance matrices A distance matrix does not say much by itself and once it has been computed a clustering method is usually applied to aggregate the sequences into a reduced number of groups In the next example the agnes function provided by the cluster library is called to create clusters from the previously computed optimal matching distance matrix see 9 4 3 The chosen method for clustering is the Ward method R gt library cluster R gt clusterward lt agnes biofam om diss TRUE method ward Next we plot the dendrogram Fig 9 1 R gt plot clusterward which plots 2 Dendrogram of agnes x biofam o
41. subsequences of equal length This is best shown using the SPS format R gt max seq lt which biofam Turbulence max turb R gt print biofam seq max seq format SPS Sequence 1 0 4 1 4 3 4 6 4 Nonetheless the correlation between entropy and turbulence measures is reasonably high whether we consider the Pearson correlation or the Spearman rank correlation R gt cor biofam Turbulence biofam Entropy 1 0 8078864 3The pearson method is the default for the cor function hence it is not necessary to specify it as an option 90 Ch 8 Sequence characteristics and associated measures R gt cor biofam Turbulence biofam Entropy method spearman 1 0 731871 Figure 8 7 is obtained with the following command and shows the relationship between the two measures R plot biofam Turbulence biofam Entropy main Turbulence vs Entropy xlab Turbulence ylab Entropy Turbulence vs Entropy 4 o o Oo o o 9 o8 e 95869 o ge o o S A So ooo o o 8 9 00 x o9 gt 37 o E o 3 8 ct o o u co o S o o o 9 ex o o o o e 7 o T T T T 2 4 6 8 Turbulence Figure 8 7 Correlation between within sequence turbulence and entropy biofam data set As previously done with the entropy we would like to have a look at some sequences having low medium and high turbulence This is achieved by first storing the values for the 10 45 55 and 90 percentiles
42. to a time unit as in the example below The actcal data set accompanying the TraMineR package is in the extended format Each column variable contains one state and represents a month of the activity calendar R gt head actcal 13 24 jan00 feb00 mar00 apr00 may00 jun00 jul00 aug00 sep00 octOO nov00 dec00 2848 B B B B B B B B B B B B 1230 D D D D A A A A A A A D 2468 B B B B B B B B B B B B 654 C C C C C C C C C B B B 6946 A A A A A A A A A A A A 1872 D B B B B B B B B B B B 3This is done by means of the seqfcheck function that searches for the presence of any separator in the data 5 2 Converting between formats 39 The compressed format In the compressed format a sequence is represented as a character string A single string variable is used for storing the sequence States or events are represented by words or numerical codes separated by a specific separator character The handling of sequences as character strings without separator is also possible However in that case states or events should be represented by single characters or digits Below the six above sequences from actcal are displayed in the compressed format They were compressed with the seqconc function that is explained below Sequence 1 B B B B B B B B B B B B 2 D D D D A A A A A A A D 3 B B B B B B B B B B B B 4 C C C C C C C C C B B B 5 A A A A A A A A A A A A 6 D B B B B B B B B B B B 5 2 Converting between formats
43. to generate this matrix is named transition In this case we simply generate a distinct event for each possible transitions The diagonal is set to the different possible states R gt data actcal R gt actcal seq seqdef actcal 13 24 labels c FullTime PartTime LowPartTime NoWork R gt transition lt seqetm actcal seq method transition R gt transition A B c A FullTime FullTime gt PartTime FullTime gt LowPartTime B PartTime gt FullTime PartTime PartTime gt LowPartTime C LowPartTime gt FullTime LowPartTime gt PartTime LowPartTime D NoWork gt FullTime NoWork gt PartTime NoWork gt LowPartTime D A FullTime gt NoWork B PartTime gt NoWork C LowPartTime gt NoWork D NoWork The second generic method called period generates a begin event and an end event for each spell The diagonal gives the sequence initiating event that we represent by the first state of the state sequence By setting bp begin each initiating event diagonal element will be different from the begin event that starts other spells in the same state Here we do not use this option and the same event is used for say starting a full time job at position 1 and at position 4 R gt transition lt seqetm actcal seq method period R gt transition 42 Ch 5 Importing and handling longitudinal data with TraMineR A Jawe D awe However most of the time we are in
44. ure 8 6 Let us mention that the user who does not like the cyan color used in the graphic can indeed use any other color from the list returned by the colors function R gt hist biofam Turbulence main Sequence turbulences biofam data set ES col cyan xlab Turbulence The distribution of the turbulences resembles that of the entropy see Figure 8 2 on page 83 With the following command we look for the most turbulent sequence R gt max turb lt max biofam Turbulence R gt biofam tmax lt subset biofam Turbulence max turb R gt head biofam tmax 8 5 Composite measures of sequences complexity 89 Sequence turbulences biofam data set o Su 2 S8 o E O Y oO N o o LL o 23 J amp a T T l 2 4 6 8 Turbulence Figure 8 6 Histogram of the sequence turbulences biofam data set idhous sex birthyr nat_1_02 plingu02 p02r01 1098 61871 woman 1953 Switzerland german Protestant or Reformed Church p02r04 cspfaj cspmoj a15 a16 a17 a18 a19 a20 a21 1098 a few times a year other self employed lt NA gt 0 0 0 0 1 1 1 a22 a23 a24 a25 a26 a27 a28 a29 a30 Entropy ageg Turbulence 10088 1 3 3 3 3 6 6 6 6 0 6666667 1949 58 8 807355 Note the use of the subset function in the previous command instead of the equivalent command R gt biofam biofam Turbulence max turb The sequence with maximum turbulence is not the same as that with maximum entropy c f Section 8 4 3 It contains four
45. values in R 60 Ch 6 Creating sequence objects Sequence 1 A 2 D B D B gt 3 D D D 4 A A 5 A A A 6 C coms RARO ee Qt Wow Pe A A A A A A B B B D D D D B D D A A A C C C By printing the sequence object in its internal matrix representation we see that all the end trailing positions are occupied by the default TraMineR character code for void elements R gt print ex1 A ext TRUE 1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 5 6 m xU POO def EJ A Qrrt ouow A aQaQarwnow A Qrunuow A aQrenow A Q0 tU Ut A 0 00 tu A 0 5 U0UtU Void elements are for instance not taken into account when computing the length of the sequences R gt seqlength ex1 A Length 1 13 2 10 3 11 4 10 5 10 6 10 Nonetheless the computed length is not 10 for all sequences The left part of sequences s1 and s3 which do not begin in the first column of the matrix has been considered as part of them To remedy to this problem we could use the left DEL option With this option all missing values at the beginning of a sequence are considered as void elements and the sequence is shifted to the left so that it begins with its first valid status R gt ex1 B seqdef exi left DEL R gt ex1 B Sequence 1 A A A A 2 D D 3 D D 4 A A 5 A 6
46. 0 7028 Let us have a look at the sequences near the minimum median and maximum entropy For that we draw sets of sequences having an entropy lower or equal to the 1st percentile an entropy near the median and an entropy greater than the 99th percentile We first store the percentiles in the ient quant vector for later usage R gt ient quant quantile biofam Entropy c 0 0 1 0 45 0 55 0 9 1 R gt ient quant 84 Ch 8 Sequence characteristics and associated measures 0 10 45 55 90 100 0 0000000 0 1124300 0 3295665 0 3738803 0 5565789 0 7028195 Now we can create a categorical variable from the quantiles with the cut function The in clude lowest is used to include the values equal to the lowest value of the cutting points i e an entropy equal to 0 R gt ient group lt cut biofam Entropy ient quant labels c Min q10 45 Median q55 90 Max include lowest T R gt table ient group ient group Min q10 45 Median q55 90 Max 223 741 176 664 196 The tent group factor has 5 levels corresponding to the five intervals defined by the quantiles But we are mostly interested in only three of the intervals A way to select them is to redefine the factor with three levels only all other values of the factor are converted to NA R gt ient group lt factor ient group levels c Min Median x Max R gt table ient group ient group Min Median Max 223 176 196 Finally we plot the thr
47. 13 to 24 R gt data actcal R gt actcal seq lt seqdef actcal var 13 24 The var argument specifying the variables that define the sequences can be a single variable or column index number a set of variables or a set of column index numbers In the next example the segdef command is used with the variable names as var argument The names function returns the names of the variables in the data frame and can be used to locate the corresponding column numbers In the actcal data set the sequences are in the variables jan00 to dec00 corresponding to columns 13 to 24 R gt names actcal 1 idhous00 age00 educat00 civsta00 nbadul00 nbkid00 7 ao1dki00 ayouki00 region00 com2 00 sex birthy 13 janoo feb00 mar00 apr00 may00 jun00 19 ju100 aug00 sep00 oct00 nov00 dec00 Notice that the column names are grouped into a vector with the c function R gt actcal seq lt seqdef actcal var c jan00 feb00 mar00 apr00 may00 jun00 julOO augOO sepOO oct00 nov00 dec00 1The class of this object is stslist 47 48 Ch 6 Creating sequence objects Using variable names instead of the column index numbers is more secure because if you delete a variable from the data frame the index numbers can change while names remain unchanged One of the attributes stored in the sequence object is the alphabet i e the list of dis
48. 27 a28 a29 0 0 99 0 95 0 92 0 88 0 82 0 71 0 60 0 50 0 42 0 34 0 25 0 20 0 16 0 13 0 10 1 0 01 0 05 0 08 0 11 0 15 0 23 0 28 0 30 0 29 0 28 0 26 0 23 0 20 0 17 0 16 2 0 00 0 00 0 00 0 00 0 01 0 02 0 03 0 04 0 05 0 06 0 07 0 08 0 08 0 09 0 10 3 0 00 0 00 0 00 0 00 0 01 0 02 0 05 0 08 0 11 0 15 0 19 0 20 0 22 0 20 0 20 4 0 00 0 00 0 00 0 00 0 00 0 00 0 00 0 00 0 00 0 00 0 00 0 00 0 00 0 00 0 00 5 0 00 0 00 0 00 0 00 0 00 0 00 9 00 0 00 0 01 0 01 0 01 0 04 0 01 0 01 0 01 6 0 00 0 00 0 00 0 00 0 01 0 03 0 05 0 08 0 12 0 15 0 21 0 26 0 31 0 36 0 40 7 0 00 0 00 0 00 0 00 0 00 0 00 0 00 0 00 0 00 0 01 0 01 0 02 0 02 0 03 0 03 a30 0 0 08 1 0 14 2 0 10 3 0 19 4 0 00 5 0 01 7 2 Describing and visualizing sequence data sets 67 1 0 2000 0 6 Freq n 0 4 0 2 0 0 L T T T T T T T T T jan00 mar00 may00 jul00 sep00 nov00 E gt 37 hours 1 18 hours 19 36 hours no work Figure 7 3 Distribution of the work statuses by month in the actcal data set data from the Swiss Household Panel ValidStates al5 al6 a17 al8 al9 a20 a21 a22 a23 a24 a25 a26 a27 a28 a29 a30 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 Entropy al5 al6 al7 al8 al9 a20 a21 a22 a23 a24 a25 a26 a27 a28 a29 a30 0 04 0 10 0 13 0 20 0 29 0 41 0 52 0 61 0 68 0 74 0 78 0 79 0 80 0 79 0 77 0 75 In addition to the state distribution at each time point the seqstatd function provides also for e
49. 3 2713 3 1990 1991 with partner married or not 4 2713 4 1991 2002 with partner married or not 5 2714 1 1968 1985 with both natural parents 6 2714 2 1985 1988 alone 7 2714 3 1989 1990 with partner married or not 8 2714 4 1990 1991 with partner married or not 9 2714 5 1991 2002 with partner married or not The variables needed to create the sequence object are an identification number to group all rows pertaining to the same individual idpers a starting and ending time for the spells bvla013 and bvla014 and a status variable bvla100 Note that the statuses in variable bvlal00 appear as labels since this variable was imported as a factor from the SPSS data file R gt seqstatl LA bvla100 1 alone 2 no answer 3 other situation 4 with both natural parents 5 with both natural parents and friends or flat share 6 with both natural parents and the partner married married 7 with friends or in a flat share 8 with one parent alone 9 with one parent and his her new partner 10 with partner married or not 11 with partner married or not and friends or flat share 12 with relatives or in a foster family But for more convenience we want shorter codes for the statuses when displaying the sequences Hence we attribute numeric codes instead of the labels with the states option The original labels are preserved and used as legends for the states which will appear in the graphics
50. 4 Parent gt Married 0 122 2 957312e 08 37 81393 21 5 Left gt Left Marr 0 234 4 036235e 07 32 72402 8 Freq lt 1945 Freq gt 1945 Resid lt 1945 Resid gt 1945 1 0 3215941 0 56568947 5 604735 6 066469 2 0 3215941 0 56568947 5 604735 6 066469 3 0 1640408 0 07274701 3 953680 4 279396 4 0 1640408 0 07274701 3 953680 4 279396 5 0 1835032 0 29315961 3 428990 3 711480 Computed on 2000 event sequences 10 4 1 Plotting the results The results of the previous analysis can then be plotted Figure 10 2 with the plot O function In the resulting plot the color of each bar is defined by the associated Pearson residual of the Chi square test For residuals 2 dark red the subsequence is significantly less frequent than expected under independence while for residuals greater than 2 dark blue the subsequence is significantly more frequent R gt plot discrcohort 1 5 10 5 More advanced topics and utilities 10 5 1 Looking after specific subsequences We may want to search only for specific subsequences For instance we may be interested in individuals that experienced the following subsequences e Parent Left LeftMarr People leaving home staying alone and getting married after that e Parent LeftMarr People leaving home and getting married after that This is a subsequence of the previous one The seqefsub function can determine the frequency of specific subsequences In order to do that th
51. 4 Within sequence entropies actcal dataset o o e eae 82 Within sequence entropies biofam dataset ees 83 Low median and high sequence entropies biofam dataset 85 Boxplot of the within sequence entropies by birth cohort biofam dataset 86 Boxplot of the within sequence entropies by sex biofam data Set 86 Histogram of the sequence turbulences biofam dataset 89 Correlation between within sequence turbulence and entropy biofam data set 90 Low median and high sequence turbulences biofam data set 91 Hierarchical sequence clustering from the OM distances Ward method 102 Sequence frequencies by cluster biofam dataset len 103 Mean time in each state by cluster biofam dataset 104 Frequencies of 15 most frequent event subsequences aooaa 108 LIST OF FIGURES 9 10 2 Five most discriminating event subsequences between those born before and after Wa a ae A rags AR Se he Oh ee de wate AA E aA 111 A 1 R starting welcome message and command prompt 4 115 Chapter 1 Introduction TraMineR is a R package for mining and visualizing sequences of categorical data Its primary aim is the knowledge discovery from event or state sequences describing life courses although most of its features apply also to non temporal data such as text or DNA sequences for instance The name TraMineR is a contr
52. 4 27 Divorced 102 1 18 20 Single 102 2 21 27 Married w Children 4 2 Longitudinal data representations 31 1230 D D D D A A A A A A A D 2468 B B B B B B B B B B B B 654 C C C C C C C C C B B B 6946 A A A A A A A A A A A A 1872 D B B B B B B B B B B B 4 2 2 The state permanence sequence SPS format The SPS format whose name State Permanence Sequence is due to Aassve et al 2007 is for instance used by Elzinga 2008 In this format each successive distinct state in the sequence is given together with its duration In one variant each state duration couple is enclosed into parentheses The example below is taken from Aassve et al 2007 R gt print seq exi format SPS Sequence 1 000 12 OWO 9 OWU 5 1WU 2 2 000 12 0W0 14 10WU 2 This format is an alternative way of representing STS sequences Here are the same sequences as they are internaly stored in a sequence object by TraMineR R print seq exi ext TRUE 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1 000 000 000 000 000 000 000 000 000 000 000 000 OWO OWO OWO OWO OWO 2 000 000 000 000 000 000 000 000 000 000 000 000 OWO OWO OWO OWO OWO 18 19 20 21 22 23 24 25 26 27 28 1 OWO OWO OWO OWO OWU OWU OWU OWU OWU 1WU 1WU 2 OWO OWO OWO OWO OWO OWO OWO OWO OWO 1WU 1WU 4 2 3 The vertical time stamped event TSE format A time stamped event
53. 46 47 48 49 50 The content of the seq vector is printed in two lines and the 26 appearing in front of the second line indicates that the first element of this second line is 26th element the vector here the value of the 26th element is 26 A 3 3 Data frames matrices and lists In R several object types are available apart from vectors The object types we will have to deal with most of the time are data frames matrices and lists We briefly describe those objects and some hints for manipulating them Data frames Since we haven t yet introduced sequential data we consider for illustrating pur poses the classical iris data set that is distributed with R We first load it into memory with the data command and display its content by typing its name gt data iris gt iris Sepal Length Sepal Width Petal Length Petal Width Species 1 5 1 3 5 1 4 0 2 setosa 2 4 9 3 0 1 4 0 2 setosa 3 4 7 3 2 1 3 0 2 setosa 4 4 6 3 1 1 5 0 2 setosa 5 5 0 3 6 1 4 0 2 setosa 6 5 4 3 9 1 7 0 4 setosa 7 4 6 3 4 1 4 0 3 setosa 8 5 0 3 4 15 0 2 setosa 9 4 4 2 9 1 4 0 2 setosa 10 4 9 3 1 1 5 0 1 setosa 141 6 7 3 1 5 6 2 4 virginica 142 6 9 3 1 5 1 2 3 virginica 143 5 8 2n 5 1 1 9 virginica 144 6 8 3 2 5 9 2 3 virginica 145 6 7 3 3 5 7 2 5 virginica 146 6 7 3 0 5 2 2 3 virginica 147 6 3 2 5 5 0 1 9 virginica 148 6 5 3 0 5 2 2 0 virginica 149 6 2 3 4 5 4 2 3 virginica 150 5 9 3 0 5 1 1 8 virginica A 3 Data manipulation in R 117 This dat
54. 5 8 2 data famform famform seq lt seqdef famform famform seq Sequence SU S U M S U M MC S U M MC 8C U M MC seqlength famform seq Length wok WD Distinct states and durations A sequence can be considered as an ordered list of the distinct states that an individual has passed through and their associated durations This is the way the state permanence SPS format represents sequences as shown here for the first rows of the actcal seq object thead actcal seq is equivalent to actcal seq 1 6 76 8 3 Summarizing the DSS lab R gt print head actcal seq SPS Sequence 1 B 12 2 D0 40 4 7 D 1 3 4B 12 4 C 9 B 3 51 A 12 6 D 15 B 11 The seqdss O and seqdur functions are provided to extract distinct states and durations from sequences Such separated information is required for example for computing sequence turbulence as will be explained below in Section 8 5 1 on page 85 In the following example we extract this separated information from the first six sequences of the actcal seq object Distinct sequences are obtained with R gt seqdss head actcal seq Sequence 1 B 2 D A D 3 B 4 C B 5 A 6 D B and durations with R gt seqdur head actcal seq DUR1 DUR2 DUR3 DUR4 DUR5 DUR6 DUR7 DURS DUR9 DUR1O DUR11 DUR12 1 12 NA NA NA NA NA NA N N NA NA NA 2 4 7 1 NA NA NA NA NA NA NA NA NA 3 12 NA NA NA NA NA NA N N NA NA NA 4 9 3 NA NA NA NA
55. 5101 6 1961 1 The pdata option is used to specify the name of the data frame the pvar option is used to specifiy the names of the columns containing the respondents id and birth year R gt LA sts process lt seqformat LA id idpers begin bvla013 end bvla014 status bvla100_rec from SPELL to STS process TRUE pdata shp birthyr pvar c idpers ES birthy In the output data each sequence now begins at age 1 The first sequence shows the living arrangement history of the first respondent He was in the state 6 with both natural parents from his birth to age 23 and then in state 10 with partner married or not from ages 24 to 37 46 Ch 5 Importing and handling longitudinal data with TraMineR R gt LA sts process 1 al a2 a3 a4 ab a6 a7 a8 a9 a10 ali a12 a13 ai4 a15 ai6 ai7 a18 a19 a20 a21 6 6 6 6 6 6 6 5 96 6 6 6 6 6 6 6 5 6 6 6 a22 a23 a24 a25 a26 a27 a28 a29 a30 a31 a32 a33 a34 a35 a36 a37 a38 a39 a40 6 6 10 10 10 10 10 10 10 10 10 10 10 10 10 10 NA NA NA a41 a42 a43 a44 a45 a46 a47 a48 a49 a50 a51 a52 a53 a54 a55 a56 a57 a58 a59 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA a60 a61 a62 a63 a64 a65 a66 a67 a68 a69 a70 a71 a72 a73 a74 a75 a76 a77 a78 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA a79 a80 a81 a82 a83 a84 a85 a86 a87 a88 a89 a90 a91 a92 a93 a94 a95 a96 a97 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA a98 a9
56. 572 Parent gt Left Marr 0 253 506 PWN Computed on 2000 event sequences Constraint Value maxGap 2 windowSize 5 10 4 Identifying discriminant event subsequences 109 The ageMin ageMax and ageMaxEnd are relative values In our case sequences start at 15 years old Thus if we want to search among subsequences beginning between ages 15 and 20 we should set ageMin to 0 i e 15 15 and ageMax to 5 i e 20 15 R gt time constraint lt seqeconstraint ageMin 0 ageMax 5 R gt fsubseq seqefsub bf seqe pMinSupport 0 01 constraint time constraint R gt fsubseq 1 4 Subsequence Support Count 1 Parent 0 9860 1972 2 Parent Parent gt Left 0 4340 868 3 Parent Left Marr gt Left Marr Child 0 2825 565 4 Parent Parent gt Left Marr 0 2530 506 Computed on 2000 event sequences Constraint Value ageMin 0 ageMax 5 If in addition we are interested only in subsequences that end before 20 years old we set ageMaxEnd to 20 R gt time constraint lt segeconstraint ageMin 0 ageMax 5 ageMaxEnd 5 R gt fsubseq lt seqefsub bf seqe pMinSupport 0 01 constraint time constraint R gt fsubseq 1 4 Subsequence Support Count 1 Parent 0 9860 1972 2 Parent Parent gt Left 0 2205 441 3 Parent gt Left 0 2205 441 4 Parent Parent gt Left Marr 0 0250 50 Computed on 2000 event sequences Constraint Value ageMin 0 ageMax 5 ageMaxEnd 5 10 4 Identifying discriminant event subs
57. 8 Oct 98 Nov 98 Dec 98 Jan 99 1 employment employment employment employment employment employment employment Feb 99 Mar 99 Apr 99 May 99 Jun 99 1 employment employment employment employment employment 3 2 4 Other data sets borrowed from the literature The famform data set is a small illustrative data set of family forms used by Elzinga 2008 It consists in 5 sequences of length 5 some having missing values NA The states are single S with unmarried partner U married M married with a child MC single with a child SC The five sequences in the data are y g y x g age m x p pe y MC y Bg y se Mo se z p ss cen where the first column contains case labels 3 3 Performance and memory usage Depending on your system and the size of your data some functions for sequence data analysis may have a consequent time and memory consumption especially the computation of distances between sequences However as the critical functions are written in C the speed performance of the functions in TraMineR compares quite advantageously with other packages that deal with sequence analysis For instance it is almost as efficient as TDA and outperforms Brzinsky Fay et al 2006 s package for Stata Nonetheless the number of distances to compute increases rapidly with the size of the dataset For a 4000 sequences dataset the number of distances to compute is 4000 3999 2
58. 9 5 JL 2 FE 25 HE 45 6 JL 3 TR 33 EM 36 Next we extract the DSS for each sequence and then compute its length R gt mvad dss lt seqdss mvad seq R head mvad dss Sequence 1 TR EM TR EM 2 JL FE HE 3 JL TR FE EM JL 4 TR EM JL 5 JL FE HE 6 JL TR EM R gt seglength head mvad dss Length 1 2 3 4 5 6 Q0 0001 0 s We can see that subtracting 1 to the DSS give the number of transition in the sequence 8 4 Summarizing state durations 79 8 4 Summarizing state durations 8 4 1 Variance of the state durations 8 4 2 Cumulated state durations The seqistatd function returns for each sequence the time spent in the different states R gt seqistatd actcal seq 1 6 A BCD i 0 3209 0 2 7 4 055 3 6 12 0 0 4 0 390 5 12 0 O0 0 65 01101 We may be interested in the mean time spent in each state These mean times can be computed by means of the apply O function with which we can apply the mean function to each column of the matrix outputted by seqistatd In the following example we first store the outcome of the seqistatd function in statd and then compute the mean value by columns 2nd dimension with the apply function R gt statd seqistatd actcal seq R apply statd 2 mean A B C D 5 0275 1 9745 1 1780 3 8200 TraMineR provides a special function to visualize the mean time values described in Chapter 7 8 4 3 Within sequen
59. 9 Measuring similarities and distances between sequences This chapter presents the measures of similarity and distance between sequences available in the TraMineR package The seqdist function is the main tool provided by the TraMineR package to compute distances between sequences It can compute the distance matrix i e the distances between all pairs of sequences in the data set or the distance to a reference sequence for example to the most frequent sequence The following metrics are available with seqdist e the Longest Common Prefix LCP e the Longest Common Subsequence LCS e the Optimal Matching distances OM These metrics and the use of the seqdist are described in the following sections 9 1 Number of matching positions The number of matching positions is a simple similarity measure We get it for a given couple of sequences with the function seqmpos as illustrated below with the famform data included in TraMineR R gt data famform R gt famform seq lt seqdef famform R gt famform seq Sequence 1 S U 2 S U M 3 S U M MC 4 S U M MC SC 5 U M MC R gt seqmpos famform seq 1 famform seq 2 1 2 R gt seqmpos famform seq 2 famform seq 4 1 48 92 9 2 Longest Common Prefix LCP distances 93 9 2 Longest Common Prefix LCP distances One of the measures of similarity distance between sequences proposed by Elzinga 2008 is based on the length of the longest common pref
60. 9 a100 1 Here R gt R gt 1 2 3 4 5 6 NA NA NA are the same sequences converted in the compressed SPS format LA sps process lt segformat LA from SPELL to SPS compressed TRUE id idpers begin bvla013 end bvla014 status bvlai00 rec process TRUE pdata shp birthyr pvar c idpers birthy head LA sps process Sequence 6 23 10 14 6 16 12 4 10 14 6 16 8 5 9 1 8 1 9 14 102 5 C6 19 C103 110 100 14 27 4 50 Chapter 6 Creating sequence objects Once your data is imported into R the next step to work with most of the functions provided by TraMineR is to create an object containing the sequence data Such objects are created with the seqdef O function This function stores the sequences in the TraMineR internal object type together with some of their attributes The segdef function accepts input data stored in several of the formats described in Chap ter 4 The ontology and formats presented in the previous chapter should help the user in identifying the original format of the data he wants to analyse with TraMineR Some examples showing how to create a sequence object from sequence data in several input formats are provided below 6 1 Creating a sequence object In the example below we load the actcal data set and create a sequence object named actcal seq with the sequences contained in columns
61. A 72 boxplot 84 bp 40 cO 46 115 cbind 115 cluster 14 101 cluster3 102 colnames 117 color palette 53 colors 53 87 compressed TRUE 44 convert factors FALSE 35 cor 88 cpal 53 55 cut 83 data 19 115 demo O 9 dev off 67 119 distance LCP 92 LCS 93 OM 96 duration in distinct state 75 of an event sequence 111 end 43 entropy 67 at each time point 66 within sequences 78 event subsequences discriminant 108 frequent 104 106 plotting frequencies 106 ez OM 99 ex3 seq 99 exclude TRUE 111 extended TRUE 54 factor 116 famform 23 56 94 fill 47 foreign 34 118 format 64 format SPS option 54 from 42 gaps 58 group 72 83 head 38 head O 35 head actcal seq 75 header FALSE 47 help about a library 18 help command 114 hist 80 id 43 ient group 83 include lowest 83 informat 47 48 86 install packages 119 122 install packages TraMineR 11 iris 115 116 La seq 51 131 132 INDEX label 55 labels 49 53 left 58 legend plotting separately 62 library O 18 118 119 library help TraMineR 124 listO 117 log 78 max 80 mean 78 method LCS 95 missing 58 MVAD 22 mvad 11 21 77 myplot pdf 119 NA 60 na strings 47 names 55 names 46 NEWS 19 norm FALSE 80 object size 98 ontology of sequence data formats 27 par Gnfrow c 2 2 11 paste 117 pbarw TRUE 68 pdf
62. Data conversion is done with the seqformat seqconc and seqdecomp functions described in this section If you just want to analyse your data with the functions provided by TraMineR you can directly use the seqdef function described in Chapter 6 and specify the input format The function seqdef will then automatically call seqformat if necessary If you want to create event sequences from state sequences for analyzing them with the TraMineR functions dedicated to event sequences see Chapter 10 for details on how to make such state to event conversions 5 2 1 Converting between compressed and extended formats The seqconc and seqdecomp functions convert between compressed and extended representa tions of sequence data The seqconc function was used above for creating the compressed string vector actcal comp We display here its 6 first elements using the head function R gt actcal comp lt seqconc actcal 13 24 R gt head actcal comp Sequence 1 B B B B B B B B B B B B 2 D D D D A A A A A A A D 3 B B B B B B B B B B B B 4 C C C C C C C C C B B B 5 A A A A A A A A A A A A 6 D B B B B B B B B B B B The seqdecomp function makes the reverse transformation to the original uncompressed format Notice that we do not need to specify the names or column indexes of the variables containing the sequence in the previously created actcal comp data set Indeed actcal comp contains a single string variable na
63. December 2000 who were at least 30 years old at the time of the survey for whom we consider sequences of their family life states between ages 15 and 30 The biofam data set is a random sample of size 2000 of the original data set It describes the family life courses of individuals born between 1909 and 1972 The possible states are numbered from 0 to 7 and were derived from time stamped event sequences using the coding of Table 3 3 The list of variables is shortly described in Table 3 4 Table 3 3 State definition for the biofam data set State Leaved parental home Married Children Divorce 0 no no no no 1 yes no no no 2 no yes yes no no 3 yes yes no no 4 no no yes no 5 yes no yes no 6 yes yes yes no 7 yes no yes no yes no yes 3 2 3 The mvad data set This data set used and described by McVicar and Anyadike Danes 2002 is now included in the TraMineR package with permission of the authors and the Journal of the Royal Statistical Society The data covers 712 individuals Each individual is characterized by 14 variables including a unique identifier id and 72 monthly activity state variables from July 1993 to June 1999 The complete list of variables is given in Table 3 5 Here we show the first row of the mvad data frame R gt data mvad R gt mvad 1 3 2 Data sets included in the TraMineR package 23 Table 3 4 List of Variables in the biofam data set Variab
64. GGs E O Qr ow A A A A A B B B B B D D D D D B B B D D A A A A C C To preserve the number of elements of the rows in the matrix void elements are added at the end right side of the sequence R gt print ex1 B ext TRUE 6 5 Truncations gaps and missing values 61 1 2 3 4 5 6 Now 1 2 3 4 5 6 7 8 9 10 11 12 13 A A A A A A A A A A ya 4 D D D B B B B B B B DD DD D D b D PD D vA A A B B B B D D 4 de A x A A A A A A A g C 4 the lengths of all sequences appears to be 10 except for s6 This is due to the fact that s6 begins with missing values that are indeed part of the sequence which have been deleted We can see t hat it is not possible to disentangle in that case the void and the real missing values at the begining of a sequence R 1 2 3 4 5 6 The seqlength ex1 B Length 10 10 10 10 10 7 same options are available for the gaps in a sequence as previously defined In the next example we ask to delete also the missing values encountered in the center part of the sequence Sequences have been reduced to their valid statuses only R gt R gt 1 2 3 4 5 6 R gt 1 2 3 4 5 6 ex1 C lt seqdef ex1 left DEL gaps DEL ex1 C Sequence A A A A A A A A A A D D D B B B B B B B D D D D D D D D D D A
65. Other data sets borrowed from the literature cles 3 3 Performance and memory usage eer 4 Definition and representation of longitudinal data formats T Ontology 2 2 4 40 9 4 eee eth te 9 9 X E A S Re aes ALA States and Events con ce os oe eS Ba ee aa ES DRE BARES RR 4 1 2 Single or multichannel lt ea ear ee Ee ee eee 4 1 3 Time reference Internal and external clocks 4 1 4 One or several rows per individual 04 4 15 Ontology s a aig dee mue a a RORIS AM a Oo we ee OE 4 2 Longitudinal data representations 4 2 1 The states sequence STS format o o e 4 2 2 The state permanence sequence SPS format 4 2 3 The vertical time stamped event TSE format 42 4 The spell SPELL formats ve ER my y A ey eee ee 4 25 The person period format 2 22 339 999 o Rom x o aa 4 2 6 The shifted replicated sequence format SRS 4 3 Definition and properties of categorical sequences o a 4 3 1 Categorical sequences ios mv a A XO 24 92 mnes AAA hax 4233 SUDSCQUCNCES ia egi pd pas eS eee TP a eee SX Wu 5 Importing and handling longitudinal data with TraMineR b Importine data sets Ito Rz oo aaa Sk a ARA 5 1 1 Reading data from other statistical packages 5 1 2 Reading data from text files soos dra wes oS Data storage m Rue sa fae eae a UE e
66. R gt ex1 lt seqdef ex1 Now that the sequence object is created we display its content R gt exi Sequence 1 A A A B B B C C C D D D 2 A D A B C B C B C D A D 3 A B A B A B C D C D C D We check with the seqistatd that the three sequences in the ex1 object contain the same number of A B C and D states R gt seqistatd ex1 ABCD H3383 2 33 33 3 33 3 3 8 4 Summarizing state durations 81 Now we are ready to verify that the entropy is the same for all sequences As shown by the results of the seqient function our claim is true The normalized entropy equals the maximum theoretical entropy i e the entropy of a sequence with all states equally frequent Unlike the entropy Elzinga 2006 s turbulence measure which you may also get with TraMineR see Section 8 5 1 is sensitive to the state ordering R gt seqient ex1 Entropy 1 1 21 i 3 1 The non normalized entropy is obtained with the norm FALSE option R gt seqient exi norm FALSE Entropy 1 1 386294 2 1 386294 3 1 386294 Now we are very impatient to plot an histogram of the within entropy of the sequences in the actcal data set We thus plot the actcal ient object using the hist function R hist actcal ient col cyan main NULL xlab Entropy The histogram can be seen in Figure 8 1 By the way we produce some summary statistics using the summary function and learn that the mean and the maximum normalize
67. RNO erererr rere NB ow sp ooooo PNNNNN fleur d iris A 4 R libraries 119 fleur d iris fleur d iris fleur d iris fleur d iris Indexing rows and columns gt iris 1 1 1 5 1 BBBB oO ON 4 6 5 0 4 4 4 9 WN WW e O PY p hh os oo Elements of an R data frame or matrix is accessed by specifying the row and or column index One solution is to give the row and column numbers as indexes The following command accesses the sepal length first column of the first iris flower first row from the iris data set Alternatively we may use the row and column names The following example is equivalent to the previous command gt iris 1 Sepal Length 1 5 1 It is also possible to use previously created row names gt iris fleur d iris n 1 Sepal Length 1 5 1 In R there are almost as many ways of doing a same thing as there are stars in the universe An additional possibility is for instance to extract the first column with the operator and to specify the first element of the resulting vector gt iris Sepal Length 1 1 5 1 For accessing more than one element we can use the number sequence generating mechanism For example we display the first 10 rows of the iris data set by issuing the following command in which the missing second argument means that all columns should be selected iris 1 10 Sepal Length Sepal Width Petal Length Petal Width Species 1 5 2 4 3 4 4 4
68. Se UNIVERSIT DE GENEVE FACULTE DES SCIENCES ECONOMIQUES ET SOCIALES D partement d conom trie Mining sequence data in R with the TraMineR package A user s guide for version 1 2 Alexis Gabadinho Gilbert Ritschard Matthias Studer and Nicolas S Miiller Department of Econometrics and Laboratory of Demography University of Geneva Switzerland http mephisto unige ch traminer April 14 2009 This work is part of the Swiss National Science Foundation research project FN 100012 113998 Mining event histories Towards new insights on personal Swiss life courses Acknowledgments TraMineR was mainly developed on a Ubuntu Linux system with several open source free tools and programs including of course R and the IATEX language used to write this manual We would like to thank all the contributors to those free softwares We also would like to thank Cees Elzinga for providing us the code of his CHESA software for sequence analysis which was helpful to program some of the metrics he introduced to compute distances between sequences Thanks also to the participants of the Research Seminar in Statistics for the Social Sciences and Demography in Geneva as well as to the participants of the Workshop on Sequential Data Analysis held in Lund Sweden May 8 9 2008 for their useful remarks and for P testing earlier versions of the package Thanks also to the Swiss Household Panel who authorized us to use a sample of their
69. TTTTTI 2 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTI Sep 93 Jul 94 Apr 95 Feb 96 Dec 96 Oct 97 Jul 98 Apr 99 Sep 93 Jul 94 Apr 95 Feb 96 Dec 96 Oct 97 Jul 98 Apr 99 State distribution plot C in E employment 9 further education El higher education a ue joblessness A S El school E training o 9 g LL 7 o ei eo BITUITTTTTTTTITTTTTHTTTTTITITTTTTTTTTHTTITTTTTTHTTITTITTTTTTTI Sep 93 Jul 94 May 95 Apr 96 Mar 97 Feb 98 Jan 99 Figure 2 1 A short example Plot of 10 first sequences top left plot of 10 most frequent sequences top right and state distribution plot bottom left mvad data set 6 Compute the optimal matching distances using substitution costs based on transition rates observed in the data and a 1 indel cost see Section 9 4 The resulting distance matrix is stored in the dist om1 object R submat seqsubm mvad seq method TRATE R gt dist om1 lt segdist mvad seq method 0M indel 1 sm submat 14 Ch 2 A short example to begin with Entropy of the state distribution Histogram of the sequence turbulence o LO 4 zx o o LO e 4 e o gt gt o o 4 m rz o o e y wi 2 E S S ce J t oO y LO N wo 4 o o I T T T T T T I I T T T T l 0 10 20 30 40 50 60 70 2 4 6 8 10 12 Time months Turbulence Figure 2 2 A short example Entropy of the state distribution left and and
70. To download R go to http www r project org Installing TraMineR is as straightforward as typing in stall packages TraMineR within a R console 2The following command is issued first to set the graphical display par mfrow c 2 2 12 2 1 State sequence analysis 13 R gt seqlegend mvad seq fontsize 1 3 4 Plot the entropy of the state distribution at each time point Fig 2 2 R gt Entropy lt seqstatd mvad seq Entropy R gt plot Entropy main Entropy of the state distribution d col blue xlab Time months ylab Entropy ES type WI it 5 Compute summarize and plot the histogram Fig 2 2 of the sequence turbulences see Section 7 3 R gt Turbulence lt seqST mvad seq R gt summary Turbulence R gt hist Turbulence col cyan main Histogram of the sequence turbulence Index plot 10 first sequences Sequence frequency plot x T E y TTT E E KR m NTI e z NANT 2 TIT E O AA ES s OEA HI e MA TT TTD z CA TCA x TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
71. a obtained with the command below shows a different pattern Figure 7 3 The distribution of the work statuses looks very stable over time R gt seqdplot actcal seq 2this data set is created by binding two character strings with the rbind function 66 Ch 7 Describing and visualizing sequences 1 0 0 8 0 6 712 Freq n 0 4 bee TURTLE TITTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTITI Jul 93 Jun 94 May 95 Apr 96 Mar 97 Feb 98 Jan 99 0 0 E employment El higher education MW school further education CJ joblessness E training Figure 7 2 Distribution of the statuses by age in the mvad data set data from McVicar and Anyadike Danes 2002 State distribution table Beside plotting the distribution of the states at each time point you may want to get the figures of the distribution The seqstatd function returns the table of the state distributions together with the number of valid states and an entropy measure for each time unit The state distributions are those visualized by the segdplot O function The following example shows the family formation state distribution from age 15 to 30 in the biofam data set see Table 3 3 page 22 for the description of the states The first argument to the seqstatd function is the previously created biofam seq sequence object R gt segstatd biofam seq Frequencies a15 al6 a17 al8 ai9 a20 a21 a22 a23 a24 a25 a26 a
72. a set contains measurements about 150 iris flowers from 3 species as we learn it by issuing the help iris command gt help iris which produces in a separate window iris package datasets R Documentation Edgar Anderson s Iris Data Description This famous Fisher s or Anderson s iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width respectively for 50 flowers from each of 3 species of iris The species are _Iris setosa_ _versicolor_ and _virginica_ The summary function returns basic statistics for all the variables in the data set gt summary iris Sepal Length Sepal Width Petal Length Petal Width Min 24 300 Min 2 000 Min 1 000 Min 0 100 1st Qu 5 100 1st Qu 2 800 1st Qu 1 600 1st Qu 0 300 Median 5 800 Median 3 000 Median 4 350 Median 1 300 Mean 5 843 Mean 3 057 Mean 3 758 Mean 1 199 3rd Qu 6 400 3rd Qu 3 300 3rd Qu 5 100 3rd Qu 1 800 Max 7 900 Max 4 400 Max 6 900 Max 2 500 Species setosa 50 versicolor 50 virginica 50 In R data frames columns variables can be of mixed types In the iris data set the variables Sepal Length Sepal Width Petal Length and Petal Width are all numerical The summary O function computes distribution indicators for them On the other hand Species is a categorical variable In R this variable type is called a factor and the values a factor may take are called levels The Species fa
73. a sets provided by the library This command has to be issued each time you start a new R session but needs to be issued only once by session All the examples in the remaining of this manual assume that the TraMineR library is already loaded You get information about the installed package such as the version number and the list of functions and data sets provided by issuing the command R gt library help TraMineR The above command opens a help window The content of the obtained help window is shown in Appendix C Using the functions TraMineR functions are just like other R functions To call them you just type in the function name and the requested arguments surrounded with parentheses Most TraMineR functions require at least the name of a sequence object created with the seqdef or the segecreate functions see Chapters 5 and 10 and optionally the values for some specific arguments If the arguments are given in the order expected by the function you can omit the argument names before their values Arguments with assigned default values can be omitted unless you want to specify a different value However always specifying the names of the arguments is more secure since 19 20 Ch 3 The TraMineR package e Adding a new optional argument to a function in a new version of TraMineR may change the order of the arguments in which case your programs would fail when the names of the arguments are not specified e Scripts are
74. aaaa sep 1 2 3 4 5 6 1 a aU a aH a a 5 2 2 The seqformat function The seqformat function takes as main arguments the name of the sequence data the names or column indexes of the variables containing the sequences the input and the output formats We describe below the various formats that seqformat can handle By default the output returned by the function is in the so called STS compressed format in which the sequences are stored as character strings Note that for translating the seqformat uses the STS format as internal intermediate format Hence some information can be lost depending on the input and output formats Converting to and from the SPS format The segformat function allows to convert from and to the state permanence sequence SPS format see Section 4 2 1 In the next example we translate the sequences contained in the actcal data frame to SPS format and store the result in the actcal SPS object R gt actcal SPS seqformat actcal 13 24 from STS to SPS R gt head actcal SPS 1 2 3 4 5 6 7 8 9 10 11 12 1 B 12 NA NA NA NA NA NA NA NA NA NA NA 2 D 4 A 7 D 1 NA NA NA NA NA NA NA NA NA 3 B 12 NA NA NA NA NA NA NA NA NA NA NA 4 C 9 B 3 NA NA NA NA NA NA NA NA NA NA 5 A 12 NA NA NA NA NA NA NA NA NA NA NA 6 D 1 B 11 NA NA NA NA NA NA NA NA NA NA Here are the same sequences but in the compressed f
75. ach time point the number of valid states and the Shannon entropy of the observed state distribution Letting p denote the proportion of cases in state at the considered time point the entropy is h pi eee Da Xp logs pi i 1 where s is the size of the alphabet The entropy is 0 when all cases are in the same state and is maximal when the same proportion of cases are in each state The entropy can be seen as a measure of the diversity of states observed at the considered time point Billari 2001 and Fussell 2005 considered for instance such entropy values for studying early life trajectories the latter author applying the concept on aggregated virtual trajectories derived from transversal data Let us look at our example above At age 15 99 of the respondents had not leaved parental home state 0 hence the entropy is very low 0 06 The entropy of the state distribution rises with age and reaches its maximum at age 27 At this age 14 percent of the respondents had not 68 Ch 7 Describing and visualizing sequences leaved parental home 29 had leaved parental home but were not married and had no children state 1 1 had one or more children without being married and 28 had one or more children and were married state 6 We can also plot the reported entropy measures For that we need to access the Entropy element of the list returned by seqstatd We first store the outcome of the function in an object named sd and e
76. action of Life Trajectory Miner for R Indeed as some may expect it was also inspired by the authors taste for Gewurztraminer wine This manual is essentially a tutorial that describes the features and usage of the TraMineR package It may also serve however as an introduction to sequential data analysis The presentation is illustrated with data from the social sciences Illustrative data sets and R scripts sequence of R commands are included in the TraMineR distribution package For newcomers to R a short introduction to the R environment is given in Appendix A in which the reader will learn where R can be obtained as well as its basic commands and principles Appendix B explains how to install TraMineR in R while Chapter 3 shortly explains how to use the package and describes the illustrative data sets provided with it 1 1 Aims and features of the TraMineR package Some of the features of TraMineR can be found in other statistical programs handling sequential data For instance TDA Rohwer and P tter 2002 which is freely available at http www stat ruhr uni bochum de tda html the T COFFEE SALTT program by Notredame et al 2006 the dedicated CHESA program by Elzinga 2007 freely downloadable at http home fsw vu nl ch elzinga and the add on Stata package by Brzinsky Fay et al 2006 freely available for licensed Stata users all compute the optimal matching edit distance between sequences and each of them offers specific usef
77. al seq 7 9 tlim 10 Freq Percent 4 3 813 40 65 D 3 581 29 05 B 3 308 15 40 0 3 174 8 70 D 2 C 1 15 0 75 0 2 D 1 11 0 55 4 1 D 2 9 0 45 D 15 0 2 9 0 45 4 2 D 1 8 0 40 D 1 4 2 8 0 40 7 2 4 Transition rates The seqtrate function computes the transition rates between states or events The outcome is a matrix where each rows gives a transition distribution from associated originating state or event in to the states in t 1 the figures sum to one in each row In the following example the transition rate matrix for the actcal activity calendar data set is computed Transition rates from one state to the same state diagonal elements have values close to 1 meaning that a person in a given state at time t has a great probability to remain in the same state at time t 1 The instability is a bit higher for the state C part time paid job from 1 to 18 hours a week since the probability of staying in that state is 0 94 while the instability of state A is the lowest with a probability of staying in that state of 0 99 R gt tr lt seqtrate actcal seq R gt round tr 2 gt A gt B gt D A gt 0 99 0 01 0 00 0 01 B gt 0 01 0 97 0 01 0 01 C gt 0 01 0 01 0 93 0 05 D gt 0 01 0 01 0 01 0 97 Notice that the matrix is not symmetrical The transition rate between states A and B is 0 005 0 5 while the transition rate from B to A is 0 01
78. alizing individual sequences 7 3 1 Visualizing individual sequences The seqiplot function renders individual sequences with stacked bars depicting the statuses over time in the same manner as seqfplot The difference is that seqiplot does neither select nor rank the sequences according to their frequencies The interest of such plots known 7 3 Describing and visualizing individual sequences 73 as index plots has for instance been stressed by Scherer 2001 Brzinsky Fay et al 2006 and Gauthier 2007 In TraMineR you can select the indexes of the sequences to be plotted with the tlim option which takes 1 10 as default value i e the 10 first rows of the sequence object Several other options are available to fine tune the graphic You find their description in the reference manual or in on line help of the function which you get by typing seqiplot or help seqiplot In the first example below the 10 first sequences in the actcal seq sequence object are plotted Figure 7 9 The legend uses the labels attached to the actcal seq object and the color palette is the one set by default In the next example we plot all sequences in the previously defined mvad seq sequence 2000 Seq 1 to 10 n jan00 mar00 may00 juloo sep00 nov00 E gt 37 hours 1 18 hours 19 36 hours no work Figure 7 9 Plot of the 10 first sequences of the actcal data set
79. ance and memory usage Number Seq System Time Matrix size of seq length 712 72 Intel Core 2 2 13GHz 2Gb RAM 21s 3 9Mb 47318 16 Intel Core 2 2 13GHz 2Gb RAM 15s 142Mb 307328 77 4x Quad Core 64 bit Xeon CPUs 2 4 54 mn 6 85Gb GHz 64GB RAM Chapter 4 Definition and representation of longitudinal data formats In Section we defined sequences as ordered lists of states or events However sequence rep resentation in data files can vary a lot depending on the way data were collected and the way information is organized In numerous cases sequences are even not present as such in the data but can be reconstructed from data originally collected as spells time stamped events or other forms Hence a crucial preliminary step in sequential data analysis is preparing the data to organize it in the form expected by the functions we want to use This is often a cumbersome discouraging task and the literature does not offer much to help identifying the main types of sequential data organization and formats Giele and Elder 1998 being one of the rare exception Conscious of the importance of the issue we devoted a lot of effort on these aspects when developing the package TraMineR provides thus a unique set of features for handling and converting data to and from several different formats This Chapter describes these formats and Chapter 5 details the data management tools available in TraM
80. and C Wunsch 1970 A general method applicable to the search for similarities in the amino acid sequence of two proteins Journal of Molecular Biology 48 443 453 Notredame C P Bucher J A Gauthier and E D Widmer 2006 T COFFEE SALTT User guide and reference manual Technical report CNRS Marseille and PAVIE University of Lau sanne available at http www tcoffee org saltt Paradis E 2005 R pour les d butants F 34095 Montpellier Institut des Sciences de l Evolution Universit Montpellier II Rohwer G and U P tter 2002 TDA user s manual Software Ruhr Universit t Bochum Fakultat fiir Sozialwissenschaften Bochum Scherer S 2001 Early career patterns A comparison of Great Britain and West Germany European Sociological Review 17 2 119 144 Studer M A Gabadinho N S Miiller and G Ritschard 2008 Approches de type n grammes pour l analyse de parcours de vie familiaux Revue des nouvelles technologies de l information RNTI E 11 II 511 522 Zaki M J 2001 SPADE An efficient algorithm for mining frequent sequences Machine Learn ing 42 1 2 31 60 Index lt 114 A B C D 96 actcal 19 21 26 28 46 47 52 81 actcal ient 80 actcal seq 52 55 actcal seq 1 6 75 agnes 101 all equal 99 alphabet 26 32 alphabet 52 alphabet O 64 apply 119 apply 78 as integer 42 as matrix 116 begin 43 biofam 19 21 23 102 103 biofam seq 98 border N
81. and install it manually as described in Chapter 3 Once the package is installed you will be able to access its functions after issuing the suited library command e g gt library TraMineR A 5 Some other useful functions A 5 1 The apply function The apply function permits to apply a function to every row or every column of a matrix or data frame This is a very useful function In the example below we create a 3 x 4 table by combining the three rows of length 4 We then compute the mean value of each column the 2nd dimension and then of each row the 1st dimension gt mat lt rbind c 1 3 5 4 c 2 3 1 5 c 2 6 3 1 gt mat 1 21 3 4 1 1 3 5 4 2 2 3 1 5 3 2 6 3 1 gt apply mat 2 mean 1 1 666667 4 000000 3 000000 3 333333 gt apply mat 1 mean 1 3 25 2 75 3 00 A 5 2 The table function For factor variables i e categorical variables the table command gives the count of each of its value As seen before the operator followed by the column name permits to extract the corresponding column from a data frame or matrix In the next example we tabulate the Species variable with the table function gt table iris Species setosa versicolor virginica 50 50 50 A 6 Creating and saving graphics The pdf and ps commands open a pdf or postscript file that will contain all the graphics plotted with plots commands eg plot The dev off must be used to close the file The
82. ary of the LA data frame shows that some variables such as the begin bvla013 and end of the spell bvla014 were imported as numeric variables distribution summarized by quan tiles while the type of living arrangement bvla100 has been imported as a factor distribution summarized by a frequency table R gt summary LA idpers q_source bvla_idx bvla013 Min 4101 2001 pretest 2627 Min 0 000 Min i sm ist Qu 3515102 2002 18484 1st Qu 1 000 ist Qu 1962 Median 7344101 Median 3 000 Median 1977 Mean 7286883 Mean r 2885 Mean 1963 3rd Qu 10820101 3rd Qu 4 000 3rd Qu 1989 Max 14676102 Max 13 000 Max 2002 bvla014 bvlai100 Min 2 with partner married or not 7438 ist Qu 1974 with both natural parents 6240 Median 1989 alone 2 38 Mean 1974 other situation 178L 3rd Qu 2001 with one parent alone 961 Max 2002 with friends or in a flat share 948 Other 1055 SPSS sav format Here we read the same data file as in the previous example but from the SPSS version which is also provided on the SHP CD The to data frame TRUE is specified so that the read spss function returns a data frame otherwise it would return a list R gt library foreign R gt LA lt read spss data shp0_bvla_user sav to data frame TRUE 5 1 Importing data sets into R 37 5 1 2 Reading data from text files Several functions are available for reading data in various text format read table read csv
83. associated measures In the example below the normalized entropies for the 10 first sequences of the actcal seq object are computed and the results are stored in an object named actcal ient As expected the entropy for the first sequence is 0 since it belongs to an individual who worked full time during all the period The entropy is higher for sequence number 2 which describes an individual who changed many times his activity status R gt actcal ient lt seqient actcal seq R gt head actcal ient Entropy 1 0 0000000 2 0 4899344 3 0 0000000 4 0 4056391 5 0 0000000 6 0 2069084 Note that this entropy measure does not account for the ordering of the states in the sequence To demonstrate this we construct a small data set containing three sequences with the same states ordered differently We first construct a vector for each sequence combine the obtained vectors into a matrix with the rbind function and eventually convert the matrix into a sequence object R gt s1 lt A 2 A uu A p BM D NN MON M D De piu R s2 lt octal De A uM pu HCY 5 B uo B ON D A m upit R gt s3 lt ECUA s EM A ns Eu A m B UT s DM E MD pe A up R gt ex1 lt rbind si s2 s3 R gt exi E E 2 LA 64 LSL L361 RA LSI L1 L 101 L 11 L127 si A wan wan pu pu pu qu ou ou p p p s2 A p wan pu au pu qu p on p wan p s3 A p wan pu AU pU qu p on p n p
84. ata is in the SPS format The additional SPS in option which is passed to the seqformat function is used to specify which characters are surrounding each state duration couple 6 1 Creating a sequence object 49 here there are no characters and which character is used as separator between each state and its associated duration here The length of the sequences is 144 but here we display the first 30 statuses only in the STS representation R gt sweden seq lt seqdef data sweden var 8 13 informat SPS SPS in list xfix sdsep R gt sweden seq 1 6 1 30 Sequence 1 M M M M M M M M M M M M M M M M M M M M M M M M M M M M M M 2 S S S S S S S S S S S S S S S S S S S S S S S S S S S S S S 3 8 8 8 8 8 8 8 8 8 8 8 8 8 5 8 8 8 8 8 8 8 8 8 8 8 8 58 5 8 8 4 S S S S S S S S S S U UC UC UC UC UC UC UC UC UC UC UC UC UC UC UC UC UC UC UC 5 8 8 8 8 8 8 8 58 8 8 8 5 5 5 8 8 8 8 5 8 8 5 8 8 8 8 58 5 8 8 6 S S S S S S S S S S U U U U U U U U U U U U U U M M M M M M Here is another example with SPS formatted sequences taken from Aassve et al 2007 We first create the sequences as character strings and assemble them with the rbind function R gt seqi lt 000 12 0WO 9 OWU 5 1WU 2 R gt seq2 lt 000 12 0WO 14 1WU 2 R gt seq exi lt rbind seq1 seq2 The seq ex1 is just a vector with 2 character strings Now we turn it into a sequence object using the seqdef function R
85. ates are defined has been proposed by Rohwer and P tter 2002 Sec 3 4 1 The authors distinguish between e a calendar time axis which does not have a natural origin Fixing an origin is simply a convention for providing time points e a process time axis where the origin represents the date of a starting event 4 1 4 One or several rows per individual The most natural way of presenting sequence data is to use one row per case However using several rows for data belonging to a same individual may also have its advantages A first example is provided by the multichannel context in which it may be worth to explicitly distinguish between sequences belonging to different domains or aspects living arrangement civil status education professional In longitudinal analysis it is also sometimes more convenient to use a distinct row e by time unit lived by each individual States of the different channels will be in columns such data presentation is commonly called person period data e by spell lived by each individual Each rows defines the states in which the individual is during the spell this presentation is called spell data and requires indeed to specify the spell start and end time or equivalently start time and duration e by episode lived by each individual i e a row for each date at which one or more events occur In this case the row contains the time stamp and the list of events that occur this kind of presentation is
86. bing and visualizing sequence data sets ln 64 7 2 1 List of states present in sequence data lll 64 G22 iStab lisbribublofhcc oa aa ad eS 65 7 2 3 Sequence frequencies e ee 69 Td Transition ADS e a cess e a CA XO A M oe xe ROE 71 7 2 5 Mean time spent in each state o o e smew 72 7 3 Describing and visualizing individual sequences 72 7 3 1 Visualizing individual sequences ororen praa pisii A dA A 72 7 3 2 Finding sequences with a given subsequence 74 8 Sequence characteristics and associated measures 76 8 1 Basic sequence characteristics ee 76 8 1 1 Sequence length lt c e traie w aa a dadaan do 3 c9 A veg us 76 8 2 Distinct states and duratiotis uu nonzn aaa aaa Dee 76 23 ummsarizinp the DOS aa a a e eodem E o do e s 77 8 3 1 Number of subsequences 1 22e 77 8 9 2 NitmberOf transitions 2 2 4 2004 e eee eee ee ES 78 8 4 Summarizing state durations cere 79 8 4 1 Variance of the state durations e m aaa 79 8 4 2 Cumulated state durations 0 0020 e ee eee 79 8 4 3 Within sequence entropy o s ser cami iot a 79 8 5 Composite measures of sequences complexity 85 8 5 l Sequence Turb lehee 4 culus Ea EERE SE A 85 5 5 2 Weighted entropy aaa a A 91 CONTENTS 5 9 Measuring similarities and distances between sequences 92 9 1 Number of matching positi
87. bott and Forrest 1986 Abbott 2001 The algorithm implemented in TraMineR is that of Needleman and Wunsch 1970 The seqdist function with method 0M generates the optimal matching distances In that case additional required arguments are e an insertion deletion indel cost e a substitution cost matrix giving the cost for substituting each state event with another 9 4 1 The insertion deletion cost The indel cost is a single value specified by the user Its default value is 1 9 4 2 The substitution cost matrix The substitution cost matrix is a squared matrix of dimension ns x ns where ns is the number of distinct states in the data the alphabet The element i j in the matrix is the cost of substituting state i with state 7 Several methods exist to generate the substitution cost matrix e Assigning a constant value in which case all substitution costs are set equal to this constant method CONSTANT option The default constant value is 2 e Using the transition rates between states observed in the sequence data method TRATE option The transition rate between state 7 and state j is the probability of observing state 7 at time t 1 given that the state i has been observed at time t For i j the substitution cost is equal to 2 pli 4 G13 where p i 7 is the transition rate between state i and state j The transition rates can be obtained by the function seqtrate The seqsubm function returns a
88. bout the activity calendar the meaning of the states A B C D is given in Table 3 1 on page 21 The most frequent sequence 38 in the data set is full time paid job during all the period January to December 2000 and appears 757 times The second most frequent sequence 25 is no paid job during all the period and appears 508 times Note that the sequences are displayed in the more readable SPS format R gt seqtab actcal seq tlim 10 Freq Percent A 12 757 37 85 84 596 2000 37 hours 19 36 hours 1 18 hours no work Daga Cum freq n 0 janoo mar00 may00 juloo sep00 nov00 Figure 7 6 Plot of the 10 most frequent sequences in the actcal data set 20 7 Parent Left Married Left Marr Child Left Child 2000 Left Marr Child Divorced Cum freq n E EBEIUL EE 0 al5 al7 a19 a21 a23 a25 a27 a29 Figure 7 7 Plot of the 10 most frequent sequences in the biofam data set with proportional bar widths 7 2 Describing and visualizing sequence data sets 71 D 19 508 25 40 B 12 250 12 50 6 12 115 5 75 6 9 D 3 15 0 75 A 10 B 2 12 0 60 B 10 0 2 8 0 40 B 11 A 1 8 0 40 D 11 C 1 8 0 40 D 9 9606 3 8 0 40 We can ask for the sequence frequency table for months July 7 to September 9 only R seqtab actc
89. cOS X and Windows the R environment comes with a command editor that you can use to write save and execute your programs Under Linux you have to resort to a separate editor such as gedit to write and save your programs You may then copy paste programs into the R console to run them or alternatively use the source command 114 A 3 Data manipulation in R 115 R version 2 7 0 2008 04 22 Copyright C 2008 The R Foundation for Statistical Computing ISBN 3 900051 07 0 R is free software and comes with ABSOLUTELY NO WARRANTY You are welcome to redistribute it under certain conditions Type license or licence for distribution details Natural language support but running in an English locale R is a collaborative project with many contributors Type contributors for more information and citation on how to cite R or R packages in publications Type demo for some demos help for on line help or help start for an HTML browser interface to help Type qO to quit R gt Figure A 1 R starting welcome message and command prompt Objects and functions Functions in R take one or more arguments Getting help Within R you can get help about a function with the help function name command including for all the functions provided by the TraMineR package Try for instance the following gt help seqdist A 3 Data manipulation in R A 3 1 Creating and printing objects 3 The operator
90. ce entropy In order to measure the diversity of the states in a given sequence TraMineR offers two functions The first one measures the entropy of the sequence and the second one which is discussed later in Section 8 5 1 is the Turbulence TraMineR provides the function seqient that returns the Shannon entropy of each sequence in the data The entropy of a sequence is computed using the formula h ni m M nilogm i l where s is the size of the alphabet and 7 the proportion of occurrences of the ith state in the considered sequence The log is here the natural base e logarithm The entropy can be interpreted as the uncertainty of predicting the states in a given sequence If all states in the sequence are the same the entropy is equal to 0 The maximum entropy for a sequence of length 12 with an alphabet of 4 states is 1 386294 and is attained when each of the four states appears 3 times The seqient function returns a vector containing the entropy for each sequence of the pro vided sequence object By default sequient normalizes the entropy by dividing the value of h m1 75 by the entropy of the alphabet The latter is indeed an upper bound of the entropy that corresponds to the maximal possible entropy when the sequence length is a multiple of the alphabet size The normalized entropy has a maximal value of 1 Unstandardized entropies can be obtained with the norm F option 80 Ch 8 Sequence characteristics and
91. cludes often a time dimension When necessary we should then also account either for the time stamp of the states or events or for the duration of either the states or the time between events For state sequences over time it is often assumed that each state corresponds to periodic dates years months For event sequences over time a specific time stamp is most often assigned to each event 4 3 3 Subsequences A sequence u is a subsequence of x if all successive elements u of u appear in x in the same order which we simply denote by u C x According to this definition unshared states can appear between those common to both sequences u and x For example u S M is a subsequence of x S U M MC Chapter 5 Importing and handling longitudinal data with TraMineR Two main preliminary steps are needed for the user to visualize and analyse sequence data with the functions provided by the TraMineR package e Import the data into R e Create a sequence object either a state sequence object as described in Chapter 6 or an event sequence object as explained in Section 10 5 4 In this chapter we first describe shortly how to import data coming from other statistical packages or text files and the way imported data is stored in R objects If your data is already in one of the formats supported by the function that creates sequence objects you may want to skip the remainder of the chapter and proceed directly to Chapter 6 However i
92. ctor has three levels gt levels iris Species 1 setosa versicolor virginica Matrices Matrices are multidimensional objects like data frames however they do not allow mixing data types For example if we try to transform the iris data frame into a matrix all the elements including numbers will be converted to character strings since one column of the data is of the character type The function as matrix is used to convert the iris data frame into a matrix There are a lot of similar functions in R for converting from one object type to another gt as matrix iris Sepal Length Sepal Width Petal Length Petal Width Species 1 5 1 3 5 1 4 0 2 setosa 2 4 9 3 0 1 4 0 2 setosa 3 4 7 3 2 L9 0 2 setosa 4 4 6 3 1 1 5 0 2 setosa 5 5 0 3 6 1 4 0 2 setosa 6 5 4 3 9 uri 0 4 setosa 7 1 4 6 3 4 1 4 9 3 setosa 118 Appendix A Installing and using R 8 1 5 0 3 4 1 5 0 2 setosa 9 1 12 4 19 9 1 4 0 2 setosa 10 4 9 3 1 1 5 0 1 setosa Lists A list is an object consisting of an ordered collection of objects It is created with the list command The list below contains for instance three components gt list ex lt list name Alice age 40 children at c 22 24 25 gt list ex name 1 Alice age 1 40 children at 1 22 24 25 We access a component by issuing for instance list ex children a
93. d by the end of compulsory education yes 5 GCSEs at grades A C or equivalent 74 Ch 7 Describing and visualizing sequences R gt seqiplot mvad seq group mvad gcse5eq tlim 0 border NA Ji space 0 R gt dev off no yes S e LO o Y N Il ll e e LO o e 2 2 c c o eo op op DULL IULII A DULL TTT Jul 93 Jun 95 Jun 97 Jun 99 Jul 93 Jun 95 Jun 97 Jun 99 Gmpuymcn anmgner euueauu OLIVA further education joblessness training Figure 7 10 Plot of all sequences of the mvad data set grouped according to the gcse5eq variable This results in a degradation of the graphic s quality but permitted however to reduce the size of this manual in pdf format by about 5MB 7 3 2 Finding sequences with a given subsequence The seqpm function counts the number of sequences that contain a given subsequence and collects their row index numbers The function returns a list with two elements The first element MTab is just a table with the number of occurrences of the given subsequence in the data Note that only one occurrence is counted per sequence even when the sub sequence appears more than one time in the sequence The second element of the list MIndex gives the row index numbers of the sequences containing the subsequence These index numbers may be useful for accessing the concerned sequences example below Since it is easier to search a pattern in a character string the function fir
94. d entropy are respectively 0 07484 and 0 97957 R gt summary actcal ient Entropy Min 0 00000 ist Qu 0 00000 Median 0 00000 Mean 0 07484 3rd Qu 0 00000 Max 0 97957 Now we would like to know what the maximum value of the within sequence entropy is and look at the sequence s reaching this maximum value The max function returns the maximum of the actcal ient vector of within sequence entropies R gt max actcal ient 1 0 979574 The which function is used to locate the row index number s of the sequences that reach this maximum entropy It is here the row number 1836 R gt index lt which actcal ient max actcal ient R index 1 1836 82 Ch 8 Sequence characteristics and associated measures e o LO o e _ gt O oor c o 2 o o LL e e LO o T l T T 0 0 0 2 0 4 0 6 0 8 1 0 Entropy Figure 8 1 Within sequence entropies actcal data set R gt actcal seqlindex Sequence 1836 A B B C D D D D C C A A The same result can be obtained more simply but also more mysteriously with a single command Below we display the rows of the actcal data frame which contain more information than the sole sequences of the actcal seq object and we can see that this is a woman aged 37 having two children aged 14 and 10 R gt actcal actcal ient max actcal ient idhous00 age00 educat00 civsta00 nbadul00 nbkid00 aoldkiOO ayouki00 5587 116151 37 apprenticeship married 2 2 14 10 re
95. e ee ee URS 5 1 4 Compressed and extended format 2 0 2 000 eee eee eee 10 10 12 12 17 19 19 20 21 21 22 24 24 4 CONTENTS 5 2 Converting between formats 644 4 es oe oe x Rs 39 5 2 1 Converting between compressed and extended formats 39 5 22 Dhe Seg format FUGO u aaia ea om a eek y dte auk d ss 40 6 Creating sequence objects AT 6 1 Creating a sequence object 2h 47 6 1 1 Creating a sequence object from SPS formatted data 48 6 1 2 Creating a sequence object from SPELL formatted data 49 6 2 Attributes of sequence objects o e e ee 51 Ode STA codes A a AA 52 022 IPRADEb uos oboe IRIS EUR o e AAA A AR 53 6 20 Color palette 2224 bp potent dale y as aa oS 54 6 24 State labels xxr eue bee OR AAAS Be Ri a ee DAR 54 0 2 5 INCAS HME a a wos ig por a a Y ae ee HRS 54 6 3 Summarizing sequence objects saa 22e 54 6 4 Indexing and printing sequence objects ees 55 6 5 Druncations gaps and Missing Values rp X 3e Rok Roux OO e eg 56 Goel NEO AUCHON ueneno Soke und tee ede RRR de qui E Pe om b A 56 6 5 2 Handling the different kinds of missing values 58 7 Describing and visualizing sequences 63 7 1 General principle of TraMineR sequence plots o o 63 7 1 1 Color palette representing the states 63 1 1 2 Plotting the legend separately coso aa AA eee 63 7 2 Descri
96. e measures of sequences complexity 85 Min Median g 2 o g N T M M e c og 2 o Xs p E E o Rd o L o o rTTTTTTTTTTTTTTI al5 al8 a21 a24 a27 a30 Max Rd a E o N E Parent n O Left E Married 3 O Left Marr E E Child P B Left Child S B Left Marr Child o a E Divorced o LTTTTTTTTTTTTTTI ai5 al8 a21 a24 a27 a30 Figure 8 3 Low median and high sequence entropies biofam data set Now we are ready to plot the entropy by ten year age cohorts We choose the boxplot command The result is shown in Figure 8 4 The Entropy ageg part of the command is a formula syntax widely used in R Here it means plot the entropy by age group R boxplot Entropy ageg data biofam xlab Birth cohort ylab Sequences entropy col cyan The mean and median entropy are rising in the more recent birth cohorts Figure 8 4 obtained with the following commands shows that the entropy is also sligthly higher in the women family formation history when compared to that of the men R gt boxplot Entropy sex data biofam xlab Sex ylab Sequences entropy col cyan 8 5 Composite measures of sequences complexity 8 5 1 Sequence turbulence Sequence turbulence is a measure proposed by Elzinga Elzinga and Liefbroer 2007 Elzinga 2006 It is based on the number of distinct subsequenc
97. e subsequences must be encoded in the text format used for displaying the subsequences see above The previous subsequences would thus be encoded as follows 10 5 More advanced topics and utilities 111 lt 1945 gt 1945 o ID E S Y o S o o o S T o t G N gt N o Ao o A c jo et G m a ES ml o z il A o E c 9 o G A al amp f S ej e S S Pearson residuals E 4 2 O neutral EJ 2 m 4 nt Parent gt Married e E G A was C Left gt Left Marr Figure 10 2 Five most discriminating event subsequences between those born before and after 1945 R gt mysubseqstr lt character 2 R gt mysubseqstr 1 lt Parent Left Left Marr R gt mysubseqstr 2 lt Parent Left Marr and here is how we get the frequency of these subsequences R gt mysubseq lt seqefsub bf seqestate strsubseq mysubseqstr R gt print mysubseq Subsequence Support Parent Left Marr Parent Left Marr 0 4870 Parent Left Left Marr Parent Left Left Marr 0 2275 Computed on 2000 event sequences The result can be used argument of functions such as seqecmpgroup 10 5 2 Counting the number of occurrence in each event sequence We now use the preceding outcome to compute with the seqeapplysub function the number of occurrences of each subsequence The seqeapplysub function takes three arguments a list 112
98. e the following example data set R gt exi 4 2 3 4 5 6 7 8 9 10 11 12 13 S1 NA NA NA A A A A A A A A A A s2 D D D B B B B B B B NA NA NA s3 NA D D D D D D o DBD D D NA NA s4 A A lt NA gt lt NA gt B B B B D D NA NA NA sb A NA A A A A lt NA gt A A A lt NA gt lt NA gt lt NA gt s6 lt NA gt lt NA gt lt NA gt C C C coc C C NA NA NA 9 4 Optimal matching OM distances 101 We define a sequence object with default options for handling missing values that is missing values appearing after the last valid state of a sequence are considered as void elements and other missing values as unknown states see section 6 5 R gt ex2 seq Sequence 1 A B C D 2 A B C D 3 A B C D A Now we compute a substitution cost matrix containing an entry for unknown states the default substitution cost for unknown states is 2 the same as the default substitution cost for the other states R gt subm lt seqsubm ex2 seq method CONSTANT with miss TRUE R gt subm NONNN V NONNNDN V and compute the OM distances with the with miss TRUE option R gt seqdist ex2 seq method 0M sm subm with miss TRUE 3 1 2 E 3 i 0 1 2 25 1 0 i 3 2 1 0 One should be careful when computing distances between sequences containing unknown states In the next example we define a sequence object with two sequences s3 and
99. e time consuming functions especially those computing distances between sequences are written in C for better performance A C compiler is therefore requested for installing TraMineR from source The installation procedure remains however straightforward Note TraMineR uses some functions provided by other optional R packages e g the RColor Brewer package for creating the color palettes in graphics those packages must be installed on your system in order to compile 124 Appendix B Installing TraMineR B 2 1 Windows Under Windows you just need to have the Windows Rtools tool set installed This tool set includes a C compiler and other tools such as Perl Thus for installing from source just proceed as follows 1 Install the Rtools toolset which can be downloaded from the web page http www murdoch sutherland com Rtools installer html 2 Download the package TraMineR 1 tar gz and remember the name of the directory where the file is saved 3 Open a DOS terminal DOS command prompt and type in cd C Program files R R 2 7 0 bin R CMD INSTALL path TraMineR_1 tar gz The cd DOS command changes the current directory to the folder where the R exe binary is installed and the second line installs the package You should indeed adapt the path to the download folder B 2 2 Linux 1 The GCC compiler and header files must be installed on your system 2 Download the package TraMineR_1 tar gz a
100. ect However if we had selected the two objects as subsets of the actcal seg sequence object they would have inherited the same alphabet R gt actcal seq lt seqdef actcal 13 24 R gt actcal si lt actcal seq 1 3 R gt actcal s2 lt actcal seq 7 9 J R gt alphabet actcal s1 1 A pu ou pu R gt alphabet actcal s2 i AU p Qu p 54 Ch 6 Creating sequence objects 6 2 3 Color palette The color palette attached to a sequence object is used by default in the graphical functions provided by TraMineR If no optional argument is provided a color palette is created with the dedicated RColorBrewer R package which is loaded at start up by TraMineR The default color palette is Accent It can be overridden by the user with the cpal option The awaited argument is a character vector containing a color for each state in the alphabet The colors function provides the list of color names available in R R gt actcal seq lt seqdef actcal 13 24 cpal c red blue green yellow The color palette for an existing sequence object may be modified by providing a vector with color names R gt attr actcal seg cpal lt c pink purple cyan yvellow q P P purp y y or by retrieving the colors from a color palette In the example below we retrieve 4 colors from the Dark color palette provided by the RColorBrewer package R gt attr actcal seq cpal lt brewer pal 4 Dar
101. ecutive states In the example below the z sequence comes from the actcal data set and contains 12 elements corresponding to the successive work statuses from January to December 2000 The same sequence formatted in the distinct successive state DSS format exhibits only 3 elements as shown by the output of the seqdss O function R data actcal R gt actcal seq seqdef actcal 13 24 R gt actcal seq 2 Sequence 2 D D D D A A A A A A A D R gt seqdss actcal seq 2 Sequence 1 D A D We can compute the number of distinct subsequences with the segsubsn function With the DSS FALSE option the returned result is 76 With the default DSS TRUE option the computation is made on the sequence of distinct successive states only D A D returning 7 as the number of distinct subsequences R gt seqsubsn actcal seq 2 DSS FALSE Subseq 2 76 R seqsubsn actcal seq 2 DSS TRUE Subseq 1 ES The segsTO function returns Elzinga s turbulence measure for each sequence of the provided sequence object We begin with a small example taken from Aassve et al 2007 The original sequences are defined in SPS format by couples of two character strings Hence we give the informat SPS option to the seqdef function for creating the sequence object see 7 2 1 for the syntax used to create the sp er1 data set 88 Ch 8 Sequence characteristics and associated measures R gt sp exi 4
102. ee sets of sequences separately using the group option Recall that the seqiplot function plots by default only the 10 first sequences but this is enough The result is shown in Figure 8 3 It confirms that the more there are different states in the sequence the higher the entropy R gt segfplot biofam seq group ient group pbarw TRUE We may want to plot the distribution of the entropy by birth cohorts It does not make sense to use the individual birth years as there are too many different values Thus we want to first group the birth years into ten year classes To do this we first look at the distribution of the birth years using the summary function Then by means of the cut function we add the new ageg cohort variable to the biofam data set The cut function takes three arguments The name of the variable from which to create classes of values the bins for creating the classes and optionally labels of the classes The include lowest TRUE option tells the function that the lowest value 1909 should be included in the first group R gt summary biofam birthyr Min 1st Qu Median Mean 3rd Qu Max 1909 1935 1944 1943 1951 1957 R biofam data frame biofam ageg cut biofam birthy c 1909 1918 1928 1938 1948 1958 label c 1909 18 1919 28 1929 38 1939 48 1949 58 include lowest TRUE R gt table biofam ageg 1909 18 1919 28 1929 38 1939 48 1949 58 35 194 449 620 702 8 5 Composit
103. ences in your data are in another format you should specify it with the format option In the following example we retrieve the alphabet for the two sequences of the sp ex1 data set R gt sp ex1 lt rbind 000 12 0W0 9 0WU 5 1WU 2 000 12 0WO 14 4WU 2 R gt sp exi Es 1 1 000 12 C 0W0 9 0WU 5 1WU 2 2 000 12 0W0 14 1WU 2 R gt seqstatl sp ex1 format SPS 1 000 OWO OWU 1WU 7 2 2 State distribution State distribution plot The seqdplot function plots a graphic showing the state distribution at each time point the columns of the sequence object The state distribution itself is obtained with the seqstatd command described below In the next example we plot the state distribution for the mvad data set We first define a mvad labels vector of state labels to be used for the legend of the colors used in the plot This vector has six elements since there are six different states in the alphabet R data mvad R mvad labels c employment further education higher education joblessness school training R gt mvad seq lt seqdef mvad 15 86 labels mvad labels The plot is produced with the following command R gt seqdplot mvad seq The resulting graphic is shown in Figure 7 2 The proportion of individuals who are employed increases to become the most frequent state at the end of the follow up The state distribution plot for the actcal dat
104. ents A event sequence object and a minimum support expressed in number of sequences minSupport or as a percentage by using the pMinSupport argument R fsubseq seqefsub bf seqe minSupport 100 R gt fsubseq 1 5 Subsequence Support Count Parent 0 9860 1972 Parent Parent gt Left 0 4340 868 Parent gt Left 0 4340 868 Left Marr gt Left Marr Child 0 2860 D72 Parent Left Marr gt Left Marr Child 0 2825 565 oP WN FE Computed on 2000 event sequences The function segefsub returns an object of type subseqelist This object can be printed plotted and indexed to select specific subsequences In our example we printed only the first 5 subsequences Notice that the subsequences look as event sequences except that they do not hold time in formation Hence the sequence Parent Parent gt Left means staying with parents and then leaving home 10 2 1 Plotting the results The results of the seqefsub function can be plotted with the plot function The graphic exhibits the frequency of each selected subsequence The following example generates the plot shown in Figure 10 1 R gt plot fsubseq i 15 col cyan 10 3 Time constraints The functions seqefsub that searches frequent subsequences and several others see below accept time constraints through the constraint parameter This constraint parameter should be the result of the seqeconstraint function which has the following parameters
105. equences The function seqecmpgroup identifies subsequences that discriminate significantly a group using the chi square test The subsequences are then ordered by decreasing discriminant power The function takes at least the first two of the following arguments subseq An subseqelist object containing the subsequences considered for discriminating the groups group The variable that defines the groups method optional By default chisq can be set to bonferroni to apply a Bonferroni correc tion to the p value 110 Ch 10 Analysing event sequences Function seqecmpgroup returns a specific subseqelist object that can be indexed printed and plotted as any subsequence list As an example we look after the subsequences that discrimi nate the most birth cohorts We first compute the list of frequent subsequences and create a cohort factor that distinguishes births before and after 1945 We then look for the most discriminating subsequences R gt fsubseq seqefsub bf seqe pMinSupport 0 01 R gt cohort factor biofam birthyr gt 1945 labels c lt 1945 gt 1945 R gt discrcohort lt seqecmpgroup fsubseq group cohort method bonferroni R gt discrcohort 1 5 Subsequence Support p value statistic index 1 Parent Parent gt Left 0 434 0 000000e 00 119 52974 2 2 Parent gt Left 0 434 0 000000e 00 119 52974 3 3 Parent Parent gt Married 0 122 2 957312e 08 37 81393 20
106. er means or want the package to be installed in your home directory or another location answer yes to the question In this case other users of your computer will not be able to use the package unless they also install it Installing from the CRAN 1 Start R 2 Install from the R command line with the install packages command A window will open asking you to pick a CRAN mirror site for your session once the mirror is selected a window will open displaying the various packages available from CRAN 3 Using the mouse select TraMineR in the list if you want to install more than one package hold down the Control key while you click on the additional packages Installing from a downloaded tar gz file 1 Download the tar gz file containing the package from http mephisto unige ch pub traminer Linux There are binaries for 32 and 64 bits linux versions The file names for the 32 bits binaries is TraMineR_1 _R_i486 pc linux gnu tar gz and for the 64 bits versions it is TraMineR_1 _R_x86_64 pc linux gnu tar gz 2 Start R 3 Install from the R command line with install packages path TraMineR 1 R i486 pc linux gnu tar gz repos NULL where path is the path to the downloaded tar gz file The repos NULL argument must be given for a local install i e one that is not done from a CRAN repository B 2 Installing from source package Though the major part of the package is written in R language som
107. erated with the seqdist function by specifying the method 0M option an insertion deletion cost and a substitution cost matrix We begin with a simple example to understand OM distances R gt ex3 se q Sequence 1 A B C D 2 A B B D 3 A B C D D 4 A B C D We generate a substitution cost matrix with constant value of 2 R gt ccost lt seqsubm ex3 seq method CONSTANT cval 2 R gt ccost NONNV ONNNV and compute the distances using the matrix and the default indel cost of 1 R gt ex3 0M lt seqdist ex3 seq method OM sm R gt ex3 0M LU 2 1 3 4 1 51 0 25 2 3 1 4 0 2 0 3 2 d 3 0 1 0 2 1 0 ccost 9 4 Optimal matching OM distances 99 The generated distance matrix contains the minimal editing costs for transforming the sequences into each other This matrix is symmetrical the minimal cost of transforming sequence x into sequence y being the same as the one of transforming sequence y into sequence x e Since a single substitution of the third state is necessary to transform sequence 1 into sequence 2 and vice versa the OM distance between them is 2 R gt ex3 0M 1 2 1 2 e One deletion insertion allow to turn sequence 4 into sequence 3 and vice versa hence the OM distance is 1 R gt ex3 0M 4 3 i 2 e The cheapest way of turning sequence 2 into sequence 3 or sequence 3 into sequence 2 involves two operations one
108. es of the actcal seq sequence object The resulting plot is shown in Figure 7 6 The labels appearing in the plot s legend are those attached to the object page 54 Notice that the legend is plotted on the right using the withlegend right option With the pbarw TRUE option the bar widths are set proportional to the sequence frequency The frequency plot for the biofam seq sequence object is obtained with the following commands and shown in Figure 7 7 The most frequent sequence living with parents without being in a partnership or having children from age 15 to 30 is shared by less than 896 of the cases This does not mean that the most frequent case is to live with both parents until age 30 But because the timing of the events of family formation spreads over many years its variability is high and the probability of having many individuals with exactly the same calendar i e changing to the same statuses at the same age is low Sequence frequency table Instead of the plot you may want numerical details counts and percentage about the most frequent sequences The seqtab function returns a frequency table of the distinct sequences in the data set Since the number of distinct sequences can be very high one can limit the table to the most frequent sequences with the tlim option The following 70 Ch 7 Describing and visualizing sequences example shows the frequency table for the actcal seq sequence object created from actcal the data set a
109. es that can be extracted from the distinct Ch 8 Sequence characteristics and associated measures 0 7 o 0 5 0 4 Sequences entropy 0 2 0 1 0 0 iS o T T T T T 1909 18 1919 28 1929 38 1939 48 1949 58 Birth cohort Figure 8 4 Boxplot of the within sequence entropies by birth cohort biofam data set 0 7 0 4 0 3 l Sequences entropy 0 2 0 1 man woman Sex Figure 8 5 Boxplot of the within sequence entropies by sex biofam data set 8 5 Composite measures of sequences complexity 87 state sequence and the variance of the consecutive times t spent in the distinct states For a sequence z the formula is Sf maz T a T z logs Ha 50 where s is the variance of the state duration for the x sequence and s mag is the maximum value that this variance can take given the total duration of the sequence This maximum is computed as follows Stimar n gt Da LS t where t is the mean consecutive time spent in the distinct states i e the sequence duration divided by the number of distinct states in the sequence and n is the length of the distinct state sequence Elzinga s definition of the turbulence is based on the sequence permanence SPS data representa tion of sequences and the number of distinct subsequences considered is that of the distinct state sequences i e the sequence obtained by considering only one of several same cons
110. event Create event sequence objects 127 seqefsub seqeid segelength seqetm seqfind seqformat seqfplot seqfpos seqgen seqient seqiplot seqistatd seqlegend seqlength seqmpos seqmtplot seqnum seqpm seqsep seqstatd seqstatl seqsubm seqsubsn seqtab seqtrate Searching for frequent subsequences Retrieve id of an event sequence object Length of event sequences Creating event transition matrix Find the occurrences of sequence s x in the set of sequences y Translation between sequence formats Graphic presenting the frequency of sequences Search for the first occurrence of a given element in a sequence Random sequences generation Within sequences entropy Visualization of individual sequences States frequency for each individual sequence Plot a legend for the states in a sequence object Sequence length Number of matching positions between two sequences Graphic presenting the mean time spent in each state of the alphabet Translate a sequence object s alphabet into numerical alphabet ranging 0 nbstates 1 Find patterns in sequences Adds separators to sequences stored as character string Sequence of the states frequency tables and entropy of the states distributions List of distinct states or events alphabet for a sequence data set Create a substitution cost matrix Number of distinct subsequences in a sequence Sequences frequency table Compute transition rates between states
111. events Parent gt Married or Parent gt Left Marr The function returns a logical vector with TRUE FALSE answer for each subsequence R gt condition lt segecontain fsubseq eventList c Parent gt Married hd Parent gt Left Marr R gt fsubseq condition Subsequence Support Count 6 Parent Parent gt Left Marr 0 2530 506 T Parent gt Left Marr 0 2530 506 12 Parent Parent gt Left Marr Left Marr gt Left Marr Child 0 1495 299 13 Parent gt Left Marr Left Marr gt Left Marr Child 0 1495 299 20 Parent Parent gt Married 0 1220 244 21 Parent gt Married 0 1220 244 36 Parent Parent gt Left Marr Left Marr gt Divorced 0 0105 21 38 Parent gt Left Marr Left Marr gt Divorced 0 0105 21 Computed on 2000 event sequences To restrict the search to a subset of events we may add the option exclude TRUE In this case the function returns FALSE for any sequence that contains an event not specified in the eventList argument 10 5 4 Duration of event sequences It may be useful to set and retrieve the time span covered by an event sequence We get the time span of an event sequence with the seqelength function There are two ways to set the duration We can define an end of sequence event in which case the time span is the time until this event occurs The end of sequence event is specified in seqecreate with the endEvent option Alternatively we can set the total sequence duration explicitly with the seqesetlen
112. f categorical time series Socio logical Methods and Research forthcoming Fussell E 2005 Measuring the early adult life course in Mexico An application of the entropy index In R Macmillan Ed The Structure of the Life Course Standardized Individualized Differentiated Advances in Life Course Research Vol 9 pp 91 122 Amsterdam Elsevier Gauthier J A 2007 Empirical categorizations of social trajectories A sequential view on the life course th se Universit de Lausanne Facult des sciences sociales et politique SSP Lausanne Giele J and G Elder Eds 1998 Methods of Life Course Research Qualitative and Quantitative Approaches CA Sage Thousand Oaks Haubold B and T Wiehe 2006 Introduction to computational biology An evolutionary approach Birkhauser Verlag Levenshtein V 1966 Binary codes capable of correcting deletions insertions and reversals Soviet Physics Doklady 10 707 710 129 130 BIBLIOGRAPHY McVicar D and M Anyadike Danes 2002 Predicting successful and unsuccessful transitions from school to work by using sequence methods Journal of the Royal Statistical Society Series A Statistics in Society 165 2 317 334 Miiller N S M Studer and G Ritschard 2007 Classification de parcours de vie l aide de optimal matching In XIVe Rencontre de la Soci t francophone de classification SFC 2007 Paris 5 7 septembre 2007 pp 157 160 Needleman S
113. for analyzing state or event sequences that describe life courses such as family formation histories or professional careers its features apply indeed also to many other kinds of categorical sequence data It accepts as input many different sequence representations and provides tools for translating sequences from one format to another It offers several statistical functions for describing and rendering sequences for computing distances between sequences with different metrics among which optimal matching the longest common prefix and the longest common subsequence and simple functions for extracting the most frequent subsequences and identifying the most discriminating ones among them A user s guide GPL gt 2 http mephisto unige ch traminer 125 126 Appendix C Information about TraMineR content Packaged Wed Mar 11 11 40 33 2009 gabuntu Built R 2 7 1 i486 pc linux gnu 2009 03 11 11 40 33 unix Index actcal Example data set Activity calendar from the actcal tse alphabet biofam cpal dissassoc disscenter dissreg disstree disstree2dot disstreeleaf dissvar famform mvad plot subsegelist plot subsegelistchisq read tda mdist seqLLCP seqLLCS seqsT seqcomp seqconc seqdecomp seqdef seqdiff seqdim seqdist seqdplot seqdss seqdur seqeapplysub seqecmpgroup seqeconstraint seqecontain seqecreate Swiss Household Panel Example data set Activity calendar from the Swiss Household Pane
114. format actcal 1 100 var 13 24 from STS to TSE tevent transition R gt head actcal tse id time event 1 0 PartTime 2 2 O NoActivity 3 2 4 Start 4 2 4 FullTime 5 2 11 Stop 6 3 0 PartTime Looking at the first record for individual 2 id number have been created from the sequences order we see that the events Start and FullTime occur at time 4 and therefore that individual number 2 started a full time job at time 4 This individual then stops working Stop at time 11 Note that the times at which the events occur are computed as the number of positions in the sequences before the new resulting state Converting from SPELL format The following command translates the LA data set described above page 49 to the STS state sequence format The from option of the seqformat O function is set to SPELL However since the variable containing the states is here a factor with very long labels we first create a new variable containing numeric codes only This is done with the as integer function which returns the numeric codes associated with each factor level We then add this new variable to the LA data frame R gt levels LA bvlai00 1 2 3 4 5 6 7 8 9 other error filter error inapplicable no answer does not know with both natural parents with one parent and his her new partner with one parent alone with relatives or in a foster family
115. friends or flat share ma Se co Jo oy co o e with partner married or not and friends or flat share 4 2 4 The spell SPELL format In the spell format there is one line for each spell Each spell is characterized by the states supposed constant during the spell and the spell start and end times Hence STS sequences can easily be constructed from this representation The following example is an extract of data drawn from the retrospective questionnaire of the Swiss Household Panel about living arrangements Statuses are described in Table 4 3 The first respondent id 2713 lived with both natural parents from 1965 to 1989 then with a partner from 1989 to 1990 and again with a partner from 1990 to 1991 and from 1991 to 2002 here we have multiple consecutive spells for the same status this is because statuses are aggregated from more detailed ones R gt SPELL exi idpers index 1 2713 2 2713 3 2713 4 2713 5 2714 6 2714 7 2714 8 2714 9 2714 10 3713 11 3713 12 3713 13 3713 14 3713 15 3713 16 11714 17 11714 1 1965 1989 ai 2 1989 1990 5 3 1990 1991 5 4 1991 2002 5 1 1968 1935 1 2 1985 1988 i 3 1989 1990 5 4 1990 1991 5 5 1991 2002 5 1 1961 1978 al 2 1978 1983 3 3 1983 1984 4 4 1984 1985 3 5 1985 1999 4 6 1999 2001 Ed 1 1973 1993 1 2 1993 2002 5 from until status 4 2 5 The person period format This format is for instance used for discrete time logistic regressions
116. ft censored e Sequences may not be left aligned depending on the time axis on which they are defined e Data may not be available for all measuring points yielding internal gaps in the sequences We consider the famform data set coming with TraMineR that contains sequences with unequal lengths Indeed the sequences contain only the distinct states that the individuals passed through The sequences are recorded in the compressed format i e as character strings R gt data famform R gt famform Sequence 1 g y 251 S U M 3 S U M MC 4 S U M MC SC 5 U M MC When translating the famform data set into the extended STS format where sequences are stored in a matrix missing values are generated to fill the empty rows R gt seqdecomp famform 13 2 3 4 5 1 S g NA NA NA 2 ngu ng y NA NA 3 gu ng wy MC NA 4 gu yn myn MC SC 5 ny M MC NA NA Varying lengths of follow up Censored data Time axis The discrete time axis on which the sequences are defined can be a calendar time axis or a process time axis see 4 1 3 A calendar time axis does not have a natural origin and fixing an origin is simply a convention for providing time points On a process time axis the origin represents the date of a starting event Suppose that we follow up respondents after they experienced some event an accident ending education and information is collected during 10 years If
117. g the sequence 2 missing values and 3 empty cells used for adjustment when the sequence is shorter than the row length To illustrate we load the example data frame ex1 In this example all elements that are not valid statuses are coded as NA s the usual way of representing missing values in R Hence we do not distinguish between missing values and empty cells R gt exi 1 2 3 4 5 e 7 8 9 10 11 12 13 s1 NA NA NA A A A A A A A A A A s2 D D D B B B B B B B NA NA NA s3 NA D D D D D D D D D D NA NA s4 A A lt NA gt lt NA gt B B B B D D NA NA NA sb A NA A A A A NA A A A NA NA NA s6 NA NA NA C C C C C C NA NA NA The sequences are stored in a 13 columns matrix but we state that the real length of each of them is in actually 10 The positions at which each sequence starts and ends are summarized in Table 6 1 Some sequences also contain gaps corresponding to unknown states Now the question is how are those sequences handled by TraMineR when we create a sequence object and later on when computing distances between sequences And which control do we have on this process To describe this we divide the sequence in three distinct parts and define three associated vectors with the indexes of the missing values in each of the three parts 6 5 Truncations gaps and missing values 59 Table 6 2 Indexes of missing values in the
118. gion00 com2 00 sex 5587 Lake Geneva VD VS GE Industrial and tertiary sector communes woman birthy jan00 feb00 mar00 apr00 may00 jun00 jul00 aug00 sep00 oct00 nov00 5587 1963 A B B G D D D D e G A dec00 5587 A The distribution of the within sequence entropies looks quite different for the biofam data set as shown in Figure 8 2 obtained with the following commands R gt biofam ient lt seqient biofam seq R gt hist biofam ient main NULL col cyan xlab Entropy 8 4 Summarizing state durations 83 e S vt e S e gt o c o 2 o o e ie UN A o ey o l T 0 0 0 2 0 4 0 6 Entropy Figure 8 2 Within sequence entropies biofam data set We would like to compare the values of the entropies conditioned on the value of a covariate In order to do this we first add a column with the sequence entropies to the biofam data frame R gt biofam lt data frame biofam seqient biofam seq We can check that the biofam data frame contains one more variable called Entropy and summarize the distribution of the Entropy variable R gt names biofam 1 idhous sex birthyr nat 1 02 plingu02 p02r01 3 7 po2r04 cspfaj cspmoj al5 al6 a17 n 13 a18 al9 a20 a21 a22 a23 19 1204 Nagga 326 22r ago 329 25 a30 Entropy R gt summary biofam Entropy Min 1st Qu Median Mean 3rd Qu Max 0 0000 0 2987 0 3333 0 3548 0 4729
119. gth function 0 1 1 10 5 More advanced topics and utilities 113 R gt bf seqe 1 3 1 Parent 9 00 Parent gt Left Marr 1 00 Left Marr gt Left Marr Child 6 00 2 Parent 1 00 Parent gt Left 10 00 Left gt Left Marr 1 00 Left Marr gt Left Marr Child 4 00 3 Parent 7 00 Parent gt Left 5 00 Left gt Left Marr 1 00 Left Marr gt Left Marr Child 3 00 R gt seqelength bf seqe 1 3 1 16 16 16 R gt sl lt numeric R gt s1 1 2000 lt 14 R gt segesetlength bf sege sl R gt segelength bf seqe 1 3 1 14 14 14 R gt bf seqe 1 3 1 Parent 9 00 Parent gt Left Marr 1 00 Left Marr gt Left Marr Child 4 00 2 Parent 1 00 Parent gt Left 10 00 Left gt Left Marr 1 00 Left Marr gt Left Marr Child 2 00 3 Parent 7 00 Parent gt Left 5 00 Left gt Left Marr 1 00 Left Marr gt Left Marr Child 1 00 Appendix A Installing and using R This appendix gives a short introduction to R It explains where and how R can be obtained and describes its basic principles and operations More detailed information can be found on the Comprehensive R project Archive Network CRAN http www r project org You may for instance download one the following introduction manual in pdf format http cran r project org doc manuals R intro pdf We also strongly recommend the introduction to R by Paradis 2005 available at http cran r project org doc contrib Paradis rdebuts_en pdf A 1 Obtaining and ins
120. hin each cluster mvad data 2 2 Event sequence analysis 17 2 2 Event sequence analysis Instead of focusing on sequences of states we can look at sequences of transitions or events TraMineR offers specific tools to deal with such kind of data For dealing with such event sequences we can 1 Define the sequences of transitions see Section 10 5 4 R gt mvad seqe lt seqecreate mvad seq 2 Look for frequent event subsequences and plot the 15 most frequent ones Fig 2 5 see Section 10 2 R fsubseq seqefsub mvad seqe pMinSupport 0 05 R gt plot fsubseq i 15 col cyan 3 Determine the most discriminating transitions between clusters and plot the frequencies by cluster of the 6 first ones Fig 2 6 see Section 10 4 R gt discr seqecmpgroup fsubseq group cl1 3fac R gt plot discr 1 6 a 7 5 e 3 gt o 2 E 3 2 a _ E o o 9 o c c 2 9 E z 9 6 3 5 9897 7 5 9 0 o ai E r 230208 5 o w9 ze o Q cC 0 O o 20005 0 9 O g 3235 nk Q A ME E 2 9 9B WW J 22538 o 5 o j o Figure 2 5 A short example Frequencies of most frequent transitions mvad data Ch 2 A short example to begin with 18 Type 2 Type 1 jue uojduie ButureJ duie uorne Ja jeuBiu qu gt yBly lt neon T T T T 1 vo c0 00 Que cou fil duje uoreonpe JayBiy Ja JayBiy lt juawA o dua ayBly lt o00y9s 00y9s
121. hird character for the union status 0 not in union U in union The alphabet contains 16 distinct states see Table 1 page 376 in Aassve et al 2007 4 1 2 Single or multichannel In the previous example each distinct state is actually a combination of states pertaining to different domains work status number of children and union status The combination of all 28 Ch 4 Definition and representation of longitudinal data formats possible states in each domain yields an alphabet of 16 distinct states As mentioned by Aassve et al 2007 the number of possible states available in different time periods implies that the frequency of any specific sequence will be very low An alternative is to handle sequences of each domain separately This is called multichannel sequences 4 1 3 Time reference Internal and external clocks Unlike biological sequences for instance trajectories in social sciences are usually defined on a time axis The information about time is an important part of sequence data when timing and or duration is a concern as in life course analysis In the case of sequences of states it is important to know whether the alignment of states is done according to e an internal time reference e g age of the individual such as in the biofam dataset e or to an external time reference e g January to December 2000 such as in the actcal dataset One typology of the discrete time axis on which the sequences of st
122. histogram of sequence turbulence right mvad data set 2 1 State sequence analysis 15 7 Make a typology of the trajectories load the cluster library build a Ward hierarchical clus tering of the sequences from the optimal matching distances and retrieve for each individual sequence the cluster membership of the 3 class solution see Section 9 5 We do not show here the dendrogram produced by plot clusterward1 which indeed is not a TraMineR feature R gt library cluster R gt clusterward1 lt agnes dist om1 diss TRUE method ward R gt plot clusterward1 R gt cl1 3 lt cutree clusterward1 k 3 R gt cli 3fac lt factor cl1 3 labels c Type 1 Type 2 Type B 8 Plot the state distribution at each time point within each cluster Fig 2 3 see Section 9 5 R gt seqdplot mvad seq group cl1 3fac 9 Plot the sequence frequencies within each cluster Fig 2 4 see Section 9 5 265 Freq n 294 Freq n R gt segfplot mvad seq group cl1 3fac pbarw T Type 1 Type 2 o Sa oo oo ay o 2 8 4 E X S x j UE L a o inn N N x e e x al e TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTETTTTTTTTTTTTTTTTTTTTTTTTTTTTTTI e TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTI Sep 93 Jul 94 May 95 Apr 96 Mar 97 Feb 98 Jan 99 Sep 93 Jul94 May 95 Apr 96 Mar 97 Feb 98 Jan 99 a E oo 2 E employment o E further education
123. ineR 4 1 Ontology Before defining and describing the main formats and representations of sequence data we begin with an ontology of longitudinal data This ontology describes the main attributes we can use to identify the various formats and characterize the nature of the sequences the user will have to deal with 4 1 1 States and events One first distinction between the several data types is whether the basic information they contain are states or events Broadly in a longitudinal framework each change of state is an event and each event implies a change of state However the state that results from an event may also depend on the previous state and hence of which other events already occurred The states of the biofam data set were for instance derived from the combination of 4 events as described in Table 3 3 page 22 Conversion between state sequences and event sequences is thus not always straightforward Figure 4 1 shows a graphical representation for 10 sequences Here the sequences are ordered list of states with the states being the work status of the corresponding respondent at each time unit ie months from January to December 2000 Though the sequences are ordered lists of states they provide also some information about events especially if we consider events as simple changes of states In sequence number 1 first one from the bottom no event occurred during 26 4 1 Ontology 27 the observation period since the respo
124. insertion deletion and one substitution yielding a cost of 3 1 2 R gt ex3 0M 2 3 t 3 e Since sequences 1 and 4 are identical no edit is needed and the OM distance is 0 R gt ex3 0M 1 4 1 O In the next example we use the substitution cost matrix previously computed for the biofam seq sequence object using the seqsubm command R gt biofam om lt seqdist biofam seq method OM indel 3 sm couts The computer needed 0 13 minutes i e 8 seconds to create the distance matrix of size 2000 x 2000 The necessary size to store the matrix is roughly 30 Mb R gt object size biofam om 1024 2 1 30 51768 Here is the extract of the distance matrix for the 10 first sequences in the data set We use the round function to get a more readable output R gt round biofam om 1 10 1 10 1 E 11 L2 5 3 L 41 5 361 07 L 8 0 9 10 1 0 0 21 3 11 6 21 6 15 6 13 9 13 9 15 1 4 0 19 3 2 21 3 0 0 15 4 17 6 11 7 29 4 29 5 13 3 21 3 7 7 3 11 6 15 4 0 0 21 7 5 8 17 7 17 9 5 7 11 6 21 4 2 21 6 IT 6 11 7 0 0 5 9 21 4 21 8 11 6 21 0 23 6 S 1506 11 7 5 8 b 9 00 21 5 Oley 7 6 15 6 17 6 6 13 9 29 4 17 7 21 4 21 5 0 0 13 9 19 6 9 9 31 4 7 13 9 29 5 17 8 21 8 21 7 13 9 0 0 19 8 13 9 31 4 8 15 1 13 3 5 7 11 6 7 6 19 6 19 9 0 0 15 1 21 0 9 4 0 21 3 11 6 21 6 15 6 9 9 13 9 15 1 0 0 21 5 10 19 3 7 7 21 4 283 6 17 6 31 4 31 4 21 0 21 5 0 0 The result of the object size function is in bytes
125. ix LLCP 9 2 1 LCP based metric The prefix of a sequence of characters states is defined as follow Let x be a sequence of length n The k prefix of x is defined as z z r where 0 lt k lt n If x abac we have z aba and z z abac The length of a sequence x is written z and we have x 4 z 3 and z 4 Hence z is a k long prefix of z The empty string A of length A 0 is a prefix of any sequence and thus x A for any zx The LCP based metric uses the length of the longest common prefix of two sequences Let P x y be the set of all common nonempty prefixes of a pair of sequence x y Since the prefix of any length is unique the length Ap z y of the longest common prefix of x and y corresponds to the size P z y of this set The seqLLCP O function returns the value of this measure for a given couple of sequences Let us take an example with the famform data set We use therefore the famform seq sequence object created in the previous section and compute the LLCP for some of the sequences R gt famform seq Sequence 1 S U 2 S U M 3 S U M MC 4 S U M MC SC 5 U M MC R gt seqLLCP famform seq i1 famform seq 2 1 2 R gt seqLLCP famform seq 3 famform seq 4 1 4 R gt seqLLCP famform seq 3 famform seql5 1 0 The LLCP for sequences 1 and 2 is S U hence its length is 2 It is S U M MC for sequences 3 and 4 yielding a LLCP of 4 The LLCP is 0 for seq
126. k2 6 2 4 State labels State labels are used as legends by the TraMineR plot functions If not specified labels are set with the state codes Use the labels option to define state labels The labels option expects a vector containing a character string for each state in the alphabet The order of the labels in the vector must match the order of the states as returned by the seqstat1 function R gt actcal seq lt seqdef actcal 13 24 labels c gt 37 hours 19 36 hours 1 18 hours no work 6 2 5 Starting time The start option specifies the starting time of the sequences Since the value yields for all the sequences in the data this information makes sense only when states are dated and when all sequences have the same starting time are left aligned Otherwise you can safely ignore this option and the value will be set to 1 This attribute is used for instance for creating column names of the sequence object when there are no column names in the input data and no names are provided by the user This attribute is also updated when selecting subscripts of a sequence object see Section 6 4 6 3 Summarizing sequence objects The generic R summary function will display some information when the name of a sequence object is given as its argument R gt summary actcal seq 3IMPORTANT the order of the states returned by the seqstat1 function may not be the same on Mac OS X systems as on Linux and Windows system
127. l time stamped event format Get or set the alphabet of a sequence object Example data set Family life states from the Swiss Household Panel retrospective biographical survey Get or set the color palette of a sequence object Analysis of pseudo variance based on dissimilarity measure Compute distance to center of a group Regression analysis of dissimilarity matrix Dissimilarity Tree Graphical representation of a dissimilarity tree Terminal node appartenance Dissimilarity based pseudo variance Example data set sequences of family formation Example data set Transition from school to work Plot frequencies of subsequences Ploting discriminant subsequences Read a distance matrix produced by TDA Compute the length of the longest common prefix of two sequences Compute the length of the longest common subsequence of two sequences Sequences turbulence Compare two sequences Concatenate vectors of states or events into a character string Convert a character string into a vector of states or events Create a sequence object Decompose the diffences between groups of sequences Returns the dimension of a set of sequences Distances between sequences Plot the sequence of state distributions Extract distinct states sequence from a sequence object Extracts states durations from a sequence object Applying Subsequences to Event Sequences Identifying discriminating subsequences Setting time constraint Event sequence contains
128. le Label idhous household number sex sex of the respondent birthy birth year of the respondent nat_1_02 first nationality of the respondent plingu02 interview language p02r01 Confession or religion p02r04 Participation in religious services Frequency cspfaj Swiss socio professional category Fathers job cspmoj Swiss socio professional category Mothers job al5 family formation status at age 15 a30 family formation status at age 30 Table 3 5 List of Variables in the MVAD data set id unique individual identifier weight sample weights male binary dummy for gender 1 male catholic binary dummy for community 1 Catholic Belfast binary dummies for location of school one of five Education and Library Board areas in Northern Ireland N Eastern Southern S Eastern Western Grammar binary dummy indicating type of secondary education 1 grammar school funemp binary dummy indicating father s employment status at time of survey 1 father unem ployed gcsebeq binary dummy indicating qualifications gained by the end of compulsory education 1 5 GCSEs at grades A C or equivalent fmpr binary dummy indicating SOC code of father s current or most recent job 1 SOC1 pro fessional managerial or related livboth binary dummy indicating living arrangements at time of first sweep of survey June 1995 1
129. living with both parents jul93 Monthly Activity Variables are coded 1 6 1 school 2 FE 3 employment 4 training 5 joblessness 6 HE jun99 E id weight male catholic Belfast N Eastern Southern S Eastern Western Grammar 1 1 0 33 no no no no no no yes no funemp gcsebeq fmpr livboth Jul 93 Aug 93 Sep 93 O0ct 93 1 no no yes yes training training employment employment Nov 93 Dec 93 Jan 94 Feb 94 Mar 94 Apr 94 May 94 1 employment employment training training employment employment employment 24 Ch 3 The TraMineR package Jun 94 Jul 94 Aug 94 Sep 94 Oct 94 Nov 94 Dec 94 1 employment employment employment employment employment employment employment Jan 95 Feb 95 Mar 95 Apr 95 May 95 Jun 95 Jul 95 1 employment employment employment employment employment employment employment Aug 95 Sep 95 Oct 95 Nov 95 Dec 95 Jan 96 Feb 96 1 employment employment employment employment employment employment employment Mar 96 Apr 96 May 96 Jun 96 Jul 96 Aug 96 Sep 96 1 employment employment employment employment employment employment employment Oct 96 Nov 96 Dec 96 Jan 97 Feb 97 Mar 97 Apr 97 1 employment employment employment employment employment employment employment May 97 Jun 97 JUL97 Aug 97 Sep 97 Dct 97 Nov 97 1 employment employment employment employment employment employment employment Dec 97 Jan 98 Feb 98 Mar 98 Apr 98 May 98 Jun 98 1 employment employment employment employment employment employment employment Jul 98 Aug 98 Sep 9
130. m diss TRUE method wi 400 J 300 200 Height 100 I biofam om Agglomerative Coefficient 1 Figure 9 1 Hierarchical sequence clustering from the OM distances Ward method The cluster membership for each sequence is then retrieved A three clusters solution is chosen here R gt cluster3 lt cutree clusterward k 3 R gt cluster3 lt factor cluster3 labels c Type 1 Type 2 Type 3 R gt table cluster3 9 5 Clustering distance matrices 103 cluster3 Type 1 Type 2 Type 3 472 502 1026 The cluster3 object is a vector containing the cluster id number for each sequence We use it to plot graphics helping to identify the typical longitudinal patterns that characterize the clusters We begin with a frequency plot for each cluster Fig 9 2 R gt segfplot biofam seq group cluster3 pbarw T Type 1 EN x q o E_E a S o CLO IE O T B c c pt id o COIT 9 o o a x COO E E x o e JA o o FTTTTTTTTTTTTTTA O TTTTTTTTTTTTTTA ai5 al8 a21 a24 a27 a30 a15 al8 a21 a24 a27 a30 3s ES E E Parent E Left E E Married g O Left Marr 2 B Child zL El Left Child E B Left Marr Child 3 s E Divorced eo
131. mely the one that stores the compressed sequences R actcal ext seqdecomp actcal comp R head actcal ext In TraMineR the default separator is but other user specified separators can be specified 5By default when no var option is specified the function assumes that the data set contains only sequence data and hence retains all columns i e here the single column of the actcal comp object 40 Ch 5 Importing and handling longitudinal data with TraMineR 1 21 3 4 5 6 7 8 9 10 11 12 i B p ape opu sp ep ngu pr Mp Ape ge npr 9 e sp ap Me Aj ge SN Wie wp age ape 3 B B B B B B B B B B B B i 559 pU mpi ngapa EN gaigui agu NEP pp ap s o xgpo Aga agi due iega Hg Aq dp as 6 D B B B B B B B B B B B In the next example taken from Aassve et al 2007 and introduced in Section 4 1 1 states are coded with character strings of length 3 and separated with the symbol Each sequence is transformed into a row vector where each element is a state associated with its duration R gt segdecomp seq ex1 1 2 3 4 1 000 12 0WO 9 OWU 5 1WU 2 2 000 12 0WO 14 1WU 2 NA To translate compressed sequences with no separator the sep option can be set to an empty string as in the following example In that case every character in the string is assumed to represent a state or event R gt seqdecomp aa
132. menu in R 1 Run R if it is not already running 2 Select Install package s from CRAN from the Packages menu in R A window will open asking you to pick a CRAN mirror site for your session once the mirror is selected a window will open displaying the various packages available from the CRAN 3 Using the mouse select the package or packages that you want to install if you want to install more than one package hold down the Control key while you click on the additional packages 4 When you are finished selecting packages click the OK button Installing from a downloaded zip file Alternatively you can install binary packages from a previously downloaded zip file 1 Download the zip file containing the package from http mephisto unige ch pub traminer Windows 2 Run R if it is not already running 3 Select Install package s from local zip files from the Packages menu in R 4 Navigate to the location of the zip file containing the package 5 Click the Open button 122 B 2 Installing from source package 123 B 1 2 Linux To install the TraMineR package into the standard library location under Linux you need to be the superuser otherwise you will get a message like this one Avis dans install packages TraMineR lib usr local lib R site library is not writable Voulez vous cr e er une bibliothV fe que personnelle R i486 pc linux gnu library 2 5 If you don t know what superus
133. mes 1 A vector is a one dimensional object its size is just its length Sequences stored in vectors are typically defined as character strings each sequence being an element of the vector 2 A matrix is a two dimensional object the two dimensions are rows and columns containing elements of the same type Sequences are typically defined as the rows of the matrix each column giving the state or event at a given time point 3 Data frame is the most common object for storing sequences It is like a matrix but can contain objects from different types for example one or more variables representing sequences as character strings or vectors of states or events and covariates Data sets imported from other statistical packages See Section 5 1 1 are stored as data frames The actcal biofam and mvad data sets are each a data frame object 5 1 4 Compressed and extended format In data files sequences may appear as character strings what we call the compressed format or as vectors what we call the extended format TraMineR can handle both formats and provides a function to convert between them For instance the seqdef and seqformat functions check first whether the data you send them as argument are in the compressed or extended format The extended format In the extended format sequences are given as vectors of states or events where each state or event is stored in a separate column variable Each variable usually corresponds
134. minimum required number of sequences to which the subsequence must belong is called minimum support It should be set by the user A subsequence is said to be mazimal if it is not included in any other frequent subsequence In addition to the support requirement TraMineR permits also to control the search of frequent subsequences with time constraints For instance we can specify a window size the maximal time span during which a subsequence should occur as well as maximum gaps the maximum time between two transitions Minimum and maximum ages can also be specified to study a particular period of the life course such as the transition to adulthood for instance 10 1 Creating event sequences Let us introduce event sequence analysis with a small example In order to perform an event sequence analysis we first create an event sequence object with seqecreate This function 105 106 Ch 10 Analysing event sequences accepts several formats Internally TraMineR uses the TSE format see Section 5 2 2 for more information Thus the natural way to define an event sequence object is from data in TSE form The actcal tse data set contains the information about the activity calendar in this format In this case we can use the following code to create an event sequence R gt data actcal tse R gt actcal seqe lt seqecreate id actcal tse id timestamp actcal tse time event actcal tsefevent We can also create an event sequence
135. n the second part of the chapter you will learn more about the functions offered by TraMineR for converting to and from several longitudinal data formats Such transformations may prove useful not only for TraMineR but also for applying other statistical methods to your data such as for instance survival analysis or classification trees 5 1 Importing data sets into R Data files generated by statistical programs such as SPSS SAS and Stata can be directly imported into R by using the foreign library and assigned to R objects We briefly explain hereafter the read spss command for importing SPSS files and the read dta command for importing Stata files Additional details can be found in the R data manual http cran r project org doc manuals R data pdf which provides also explanations regarding other file formats Data in the form of text files or spreadsheets can also be easily imported 5 1 1 Reading data from other statistical packages Preliminary remarks When importing SPSS or Stata files variables having attached values labels are converted into R factors with levels set to the value labels in the original files For example a variable containing states 1 2 3 4 with value labels single living with a partner married divorced will be converted into a factor with the four levels single living with a 1On Ubuntu Linux and maybe on other Linux distributions the foreign library is not installed wi
136. nctions for visualizing and describing sequences at the aggregate level 7 2 1 List of states present in sequence data A first result we may want is just the list of states present in the data set This is obtained with the alphabet function when the list of states has not been explicitly specified by the user The lin that case some states in the alphabet may not appear in the data See Sec 6 for more information on this topic 7 2 Describing and visualizing sequence data sets 65 alphabet function returns the list of the possible states for a sequence object In the following example we see that the alphabet for the actcal seq sequence object contains 4 distinct states A B C and D see Table 3 1 page 21 for their description R gt data actcal R gt actcal lab lt c gt 37 hours 19 36 hours 1 18 hours no work R gt actcal seq lt seqdef actcal 13 24 labels actcal lab R gt alphabet actcal seq 1 AU p Qu peu To get the list of all distinct states appearing in a data set containing sequences not converted into a sequence object use the seqstat1 function You tell segstat1 which variables define the sequence data by providing with the var argument either their names or their column index numbers For specifying the columns by their names you have to group them into a vector with the cO function By default the seqstat1 function expects a STS formatted data set as input If the sequ
137. nd remember the directory name where the file is saved 3 You must have superuser root access for installing the package on the standard library location Open a terminal and type in you will be asked for the superuser password sudo R CMD INSTALL path TraMineR 1 tar gz Appendix C Information about TraMineR content Below we show the content of the information window obtained with library help TraMineR This information indicates among others the version number of the installed TraMineR package and the list of available functions and data sets Indeed since further versions of TraMineR will most probably offer new features we strongly recommend that you check the updated information window on your system after installing a new version Description Package Version Date Title Author Maintainer Depends Suggests Description License URL TraMineR 1 2 2009 03 11 Sequences and trajectories mining for social scientists Alexis Gabadinho lt alexis gabadinho unige ch gt Matthias Studer lt matthias studer unige ch gt Nicolas S Muller lt nicolas muller unige ch gt Gilbert Ritschard lt gilbert ritschard unige ch gt Alexis Gabadinho lt alexis gabadinho unige ch gt R gt 2 7 1 RColorBrewer boot cluster This package is a toolbox for sequence manipulation description rendering and more generally sequence data mining in the field of social sciences Though it is primarily intended
138. ndent stays in the same state during the whole sequence In sequence 2 second from bottom two events occurred 2000 Seq 1 to 10 n jan00 mar00 may00 juloo sep00 nov00 E 37hours 1 18 hours 19 36 hours no work Figure 4 1 First 10 sequences of the actcal data first at bottom e The respondent changed his work status between time unit 4 April 2000 and time unit 5 May 2000 from no work to full time paid work e Then the respondent changed again his work status between time unit 11 November 2000 and time unit 12 December 2000 from full time paid work to no work States or events can be coded with letters character strings or digits The alphabet is the list of all possible states or events appearing in the data In the following example taken from Aassve et al 2007 states are coded with character strings of length 3 and separated by the character We will see other formats to represent such sequences in the following sections R gt seq exi 10 25 Sequence 1 000 000 000 0W0 0WO 0WO 0WO OWO OWO OWO OWO OWO OWU OWU OWU OWU 2 000 000 000 0W0 0WO 0WO 0WO OWO OWO OWO OWO OWO OWO OWO OWO OWO For each state in the sequence the first character stands for the number of children 0 no children 1 1 children etc the second character for the work status 0 not working W working and the t
139. ng the monthly activity statuses from January to December 2000 R gt actcal 1 26 13 24 jan00 feb00 mar00 apr00 may00 jun00 jul00 aug00 sep00 oct00 nov00 dec00 2848 B B B B B B B B B B B B one state per time unit t several states at each t Longitudinal data spell duration Figure 4 2 Ontology of types of longitudinal data 30 Ch 4 Definition and representation of longitudinal data formats Table 4 1 Sequence data representations One 1 or S tates or several M Import into a sequence Code Data type pde rows for os object individual STS State sequence S 1 Yes SPS State permanence 1 S 1 Yes DSS Distinct Spes S 1 Yes use STS Sequence TSE Time stamped event E M Yes event sequence SPELL Spell S M Yes Person period M Table 4 2 Sequence data representations Examples Code Example Id 18 19 20 21 22 23 24 25 26 27 STS 101 S S S M M MC MC MC MC D 102 S S S MC MC MC MC MC MC MC ld State1 State2 State3 State4 State 5 SPS 1 101 S 3 M 2 MC 4 D 1 102 S 3 MC 7 Id State 1 State 2 State 3 State 4 State 5 SPS 2 101 S 3 M 2 MC 4 D 1 102 5 3 MC 7 ld State 1 State 2 State 3 State 4 State 5 DSS 101 S M MC D 102 S MC id time event 101 21 Marriage 101 23 Child TE 101 27 Divorce 102 21 Marriage 102 21 Child id index from to status 101 1 18 20 Single 101 2 21 22 Married SPELL 101 3 23 26 Married w Children 101
140. o compare two sequence data sets and that there are some states in one data set that are not present in the other one Without explicitly specifying the list of the possible states with the alphabet option when creating the sequence objects from these data sets the missing states will not be accounted for which may produce misleading results when comparing tabulation of the state frequency of the two data sets The colors attributed to the states will also be different for each data set which may also be source of confusion Let us take a short example to illustrate this point We create two sequence objects one with the first three sequences of the actcal data set R gt actcal si seqdef actcal 1 3 13 24 R gt alphabet actcal s1 1 AU p up and one with sequences 7 to 9 R gt actcal s2 lt segdef actcal 7 9 13 24 R gt alphabet actcal s2 1 gr p In the first example the alphabet is set to A B D while in the second object it is set to A D Since we know that the possible states are A B C D we specify manually the alphabet for the first R gt actcal si lt seqdef actcal 1 3 13 24 alphabet c A B nG ipi R gt alphabet actcal s1 1 A pu ou pu and the second object R gt actcal s2 seqdef actcal 7 9 13 24 alphabet c A B m ON nwy R gt alphabet actcal s2 1 A pu no pu which permits to directly compare plots and tabulations of each sequence obj
141. ongest common subsequence is indeed U S Now we compute the longest common subsequence of the pair x z R gt seqLLCS x z 1 4 The longest common subsequence of a z is S U M S It appears in x as 1112145 and in z as 21292324 Now if we define the attribute Ar z y maz u u S ax y where u is the length of the longest common subsequence for the pair of sequences x y a LCS distance can be defined as dc x y x y 24 x y and a similarity as _ Arc x y ES ne Since A the empty string is a substring subsequence of any sequence we have S x y gt 1 96 Ch 9 Measuring similarities and distances between sequences 9 3 2 Computing LCS distances LCS based distances can be computed with the seqdist function using the method LCS option In the following example the results for the three sequences are those shown in the lower triangle of Table 7 in Elzinga 2008 R gt seqdist LCS ex method LCS LI Ee 3 i 0 6 5 2 1 6 0 3 3 5 3 0 In the next example we compute the LCS distances for the biofam seq sequence object previously created from the biofam data frame R gt biofam lcs lt seqdist biofam seq method LCS and print the distance matrix for the first 10 sequences R gt biofam lcs t1 10 1 10 311 L 21 3 L 4 GSI LEI 7 8 9 10 1 O 20 10 22 16 14 14 14 4 20 2 20 0 12 10 8 30 30 14 22 6 3 10 42 0 12 6 18 18 4 12 16 4 22
142. ons lt lt on E Rooms E ss 92 9 2 Longest Common Prefix LCP distances eee 93 9 2 1 LCP based metic s cia Room dd a Re ee as 93 9 2 2 Computing LCP distantes s s s s u da kA DE mom x Romer aa aoe OS 94 9 3 Longest Common Subsequence LCS distances o o e 94 9 3 1 LCS based metric ied da de ee ea ad a Pee Be 95 9 3 2 Computing LCS distances lt s scs u e sapii adia naaa i ee 96 9 3 3 LCS distances with internal gaps 96 9 4 Optimal matching OM distances 2 2 0 0 000 002 0020 0000 97 9 4 1 The insertion deletion cost 020000000008 97 04 2 The substitution cost matrix ones ro cox 24 2 see A b Roe a gos m RR 97 9 4 3 Generating optimal matching distances o 98 9 4 4 LCS distance as a special case of OM distance 100 9 4 5 Optimal matching with internal gaps o 100 90 5 Clustering distance matrices 2 609 6 e ee m cR Rom mo md P Eo a EEE X 102 10 Analysing event sequences 105 10 1 Creating event sequences 2 4 222 a 105 10 2 Searching for frequent event subsequences ess 107 10 2 1 Plotting the resulto se ooo RR dx SS 107 10 9 Time Constraints gt gt s nono nce py ina nad dae Rd ve Eee eee E OR X ow 9 ee 107 10 4 Identifying discriminant event subsequences less 109 10 4 1 Plotting the resulto s i s aa x on Rx ewae ey o6 a A 110 10 5 More advanced topics and utilities
143. orm 5 2 Converting between formats 41 R gt seqconc actcal SPS 1 6 Sequence 1 B 12 2 D 4 A 7 D 1 3 B 12 4 C 9 B 3 5 4 1234 6 1D 1 B 11 Converting to TSE format In order to extract time stamped events from a sequence of statuses which is the internal format used by TraMineR a matrix of size ns x ns must be given where ns is the number of distinct states appearing in the sequences In this matrix the cell a b where a is the row index and b the column index contains a comma separated list of all events associated with a transition from state a to state b The diagonal of this matrix has a special meaning It defines the initial event of the sequence For example the position a a gives the event generated when the sequence starts with state a The exact design of this matrix can be tricky since a transition may imply several events and the same event may appear in several transitions However TraMineR implements several basic generic methods to build this matrix with the function seqetm You can then adapt the generated matrix to your need by editing the appropriate cells However if you create your own matrix from scratch you should be aware that row and column names of the matrix MUST BE in a one to one mapping the states appearing in the data set since they are used to retrieve the events associated with transitions from one state to the other The first generic method
144. qd plot for plotting the state distribution at each time point seqfplot for plotting the frequen cies of the most frequent sequences and seqiplot for plotting all or a selection of individual sequences 7 1 1 Color palette representing the states The before mentioned plot functions have in common to use a specific color for each state The choice of the colors is done by selecting a color palette Indeed for facilitating readability it is important to use the same color palette for all plots based on a same alphabet The philosophy retained in TraMineR is therefore to attach the alphabet and the color palette as attributes of the sequence object see Section 6 2 and letting the plotting functions retrieve these attributes when generating the plots The same is true also for the labels of the time axis ticks and the labels of the states 7 1 2 Plotting the legend separately To be understandable a plot must be accompanied by the legend of the used state colors By default each sequence plot produces therefore the legend on the top of the graphic using the attributes of the plotted sequence object In some cases especially when you generate multiple plots for instance a state distribution plot and an sequence frequency plot it may be preferable to generate plots without legends and produce the legend only once separately For doing so TraMineR provides the seqlegend function that generates the legend has a separate graphic and a wi
145. raMineR content 125 Bibliography 129 List of Tables 3 1 3 2 3 3 3 4 3 5 3 6 4 1 4 2 4 3 5 1 5 2 5 3 6 1 6 2 State definition for the activity calendar actcal dataset 21 Covariates and state variables of the activity calendar actcal data set 22 State definition for the biofam dataset een 22 List of Variables in the biofam dataset les 23 List of Variables in the MVAD dataset lll 23 Performance and memory usage accea d aeea kana dadi ha 25 Sequence data representations 2 2l l4 ee 30 Sequence data representations Examples c 30 Living arrangements SHP oe oh 2 oS aa a y ee Rok e x RE S 32 Considered events of the activity calendar actcal data set dataset 42 Events associated to each state transition ooo 42 Structure for the spell format lt s obo ceo yo ox Ro gogo Ro 44 Start and end of the sequences in the ezf dataset 58 Indexes of missing values in the three parts of the sequences 59 List of Figures 2 1 2 2 2 3 2 4 2 5 2 6 4 1 4 2 re 7 2 7 3 7 4 7 5 7 6 Tel 10 1 A short example Plot of 10 first sequences top left plot of 10 most frequent sequences top right and state distribution plot bottom left mvad dataset 18 A short example Entropy of the state distribution left and and histogram of sequence turbulence right
146. rics for evaluating distances between sequences aggregated and index plots of sets of sequences e Specific TraMineR functions can be combined in a same script with any of the numerous basic statistical procedures of R as well as with those of any other R package Before describing the usage of the TraMineR package for R a few remarks are worth on the nature of sequence data considered in the particular field of social sciences In the social sciences sequence data represent typically longitudinal biographical data such as employment histories or family life courses Following for instance Brzinsky Fay et al 2006 we may simply define a sequence as an ordered list of states employed unemployed or events leaving parental home marriage having a child For now let us just retain that there are multiple other ways of representing longitudinal data that will be discussed in more details in Chapter 4 and that TraMineR will prove useful for converting from one form to the other 2Minor changes may be needed in case of references to file names and paths or other interactions with the OS Chapter 2 A short example to begin with Nothing is better than an example to present the features of TraMineR We will use for this purpose an example data set from McVicar and Anyadike Danes 2002 which has been included with the package see Section 3 2 The data stems from a survey on transition from school to work and contains 72 monthly activit
147. s 6 4 Indexing and printing sequence objects 55 gt dimensionality of the sequence space 36 gt 2000 sequences in the data set gt 186 unique sequences in the data set gt min max sequence length 12 12 gt alphabet 1 A 2 B 3 C 4 D gt colors 1 7FC97F 2 BEAED4 3 FDCO86 4 FFFF99 gt labels 1 gt 37 hours 2 19 36 hours 3 1 18 hours 4 no work gt code for missing statuses The dimensionality is the number of dimensions necessary for constructing the sequence space Haubold and Wiehe 2006 i e d A 1 where A is the size of the alphabet and the maximal length of the sequences 6 4 Indexing and printing sequence objects Displaying a sequence object is as simple as typing its name However displaying a sequence object containing 2000 rows such as actcal seq for instance is not very interesting Subscripts can be used to display only selected rows and or columns of the data Subscripts and indexes work the same way as for matrices and data frames In the next example we display only the first 5 sequences and columns 3 to 8 March to August of the previously created actcal seq sequence object Typing a sequence object name with or without subscripts is equivalent to issuing the print command with the object name as argument R gt actcal seq 1 5 3 8 Sequence 1 B B B B B B 2 D D A A A A 3 B B B B B B 4 C C C C C C 5 A A A A A A Note that the sequences are
148. st translates the sequence data in this format when using the seqconc function with the TRUE option In the following example we search for the pattern DAAD see Table 3 1 page 21 for the meaning of the states into the activity calendar sequence data object R gt seqpm actcal seq DAAD MTab pattern nbocc 1 DAAD 4 MIndex 1 964 967 1197 1797 7 3 Describing and visualizing individual sequences 75 Four sequences contain the pattern If we want to look at the sequences containing the DAAD subsequence we use the MIndex element of the list returned by the seqpm function We first store the result of the function in an object named daad and then access the sequences containing the pattern using daad MIndex as row index for the actcal seq sequence object since we want all the columns we leave the column index empty R gt daad seqpm actcal seq DAAD R gt actcal seq daad MIndex Sequence 964 D A A D D D D D D A A A 967 D D A A D D A A A A A A 1197 D D B B C D D A A D C C 1797 D D D D A A D A B B D D Chapter 8 Sequence characteristics and associated measures This chapter focuses on the characterization of individual sequence properties and their summary 8 1 Basic sequence characteristics 8 1 1 Sequence length The seqlength function returns the length of the sequences in a sequence object R gt R gt R gt 1 2 3 4 5 R gt 1 2 3 4
149. sta00 nbadul00 nbkid00 7 aoldkiOO ayouki00 region00 com2 00 sex birthy 13 janoo feb00 mar00 apr00 may00 jun00 19 juiloo aug00 sep00 oct00 nov00 dec00 J B P R gt actcalli J idhous00 age00 educat00 civsta00 nbadul00 nbkid00 aoldkiO0 ayouki00 2848 60671 47 maturity married 3 2 NA 14 region00 com2 00 2848 Middleland BE FR SO NE JU Industrial and tertiary sector communes sex birthy jan00 feb00 mar00 apr00 may00 jun00 jul00 aug00 sep00 oct00 2848 woman 1953 B B B B B B B B B B novOO dec00 2848 B B This data set contains a sample of 2000 records of individual monthly activity statuses from January to December 2000 with the activity statuses coded as described in Table 3 1 In addition it contains also first 12 columns some covariates gathered at the individual and household level The variables in the data set are listed in Table 3 2 Sequences are in the columns named jan00 feb00 etc The row labels are just id numbers Notice that the numbering is not consecutive This is because cases were randomly selected Each row contains a sequence of states i e activity statuses reported by a respondent to the wave of year 2000 of the SHP survey The respondent whose activity calendar is in row 1 stayed in a part time 19 36 hours per week payed job during the whole period The respondent in row 2 labeled 1230 had no job between January and April 2000 then worked full time bet
150. substitution cost matrix generated with one of the above two methods With the method CONSTANT option you provide the constant as cval argument while this argument is ignored with the method TRATE option An example with a constant substitution cost is given on page 100 In the example below the substitution cost matrix is generated using the transition rates in the data R gt couts seqsubm biofam seq method TRATE R gt round couts 2 98 Ch 9 Measuring similarities and distances between sequences 0 gt 00 95 98 97 00 00 09 00 rara ene VNMYV WWW Y Ny NENNDNRRRO NENNRNOR 1 gt 95 00 00 92 00 00 98 00 RRANRRONER 2 gt 98 00 00 99 88 00 99 99 PRENNORR PO 3 gt 97 292 499g 00 00 00 80 99 NNRPONPNN 4 gt 00 00 88 00 00 94 00 00 NPRORNNNN 5 gt 00 00 00 00 94 00 88 00 PORPNRPRPR PB 6 gt 299 98 99 80 00 88 00 99 OrRNNRRNN 7 gt 00 00 X 99 00 00 99 00 The alphabet is composed of 8 distinct states so the substitution cost matrix has dimension 8 x 8 We can check with the range O function that the minimum cost is 0 for a substitution of one state by itself and the maximum is 2 meaning that the transition never occurs in the data set R gt range couts 1 0 2 9 4 3 Generating optimal matching distances Optimal matching distances are gen
151. t or since we want here the 3rd component list ex 3 A 3 4 Accessing and extracting data Row and column names Data frames and matrices have rows and column names lists have elements names The rownames and colnames functions can be used to access modify or print these labels The column names are correspond to what is known variable names in other statistical packages like Stata SPSS or SAS gt colnames iris 1 Sepal Length Sepal Width Petal Length Petal Width Species Row names are names assigned to the rows of the data object gt rownames iris 1 wey non wg wan p gr wre ngu ugu 10 wid 12 13 14 31 14 ib 16 37 18 19 20 oi 22 93 24 133 133 134 135 136 137 138 139 140 141 142 143 144 145 145 146 147 148 149 150 Default row names are the row numbers as illustrated above for the iris data set Any character string can be used as row name With the paste command that concatenates its arguments into a character string we may for instance create a vector of 150 names composed with the French string fleur d iris n and a number from 1 to 150 and assign this vector as row names gt row names iris lt paste fleur d iris n 1 150 gt iris Sepal Length Sepal Width Petal Length Petal Width fleur d iris 5 1 3 5 1 4 O fleur d iris fleur d iris fleur d iris fleur d iris BBBB B B DONA c o uda 4 O O ww www ORrF
152. talling R R is a free integrated suite of software facilities for data manipulation calculation and graphical display It is available in precompiled binary form for Linux MacOS X and Windows and more generally in source form that can be compiled under many other operating systems You can download R from the CRAN http cran r project org select a mirror close to you where you find also installation instructions A 2 R basics Starting R Although there exist menu driven graphical user interfaces for R R is originally a command line environment When starting R you get a command line prompt showed in a R console under Windows at which you can enter commands If you are using Linux just launch a terminal and enter R at the command prompt In Figure A 1 shows the screen display and command prompt as it appears after launching R in a Linux console Here the greeting message is in French because the authors of this manual run a French version of R To quit R enter the command q You will be prompted for saving your workspace Answer y if you want to save all your data and objects Your workspace will then be restored the next time you use R Writing and saving R program files The best way of using R is to write command files R command files usually have a R extension You can add comments in your program files Starting with a double hashmark everything to the end of the line is a comment Under Ma
153. tatd mvad seq R gt plot mvad sd Entropy xlab Month ylab Entropy type 1 lwd 3 5 col blue axes F cex 1 3 frame plot T ylim c 0 1 R gt axis 1 labels names mvad seq at 1 72 R gt axis 2 at c 0 0 2 0 4 0 6 0 8 1 7 2 Describing and visualizing sequence data sets 69 Entropy 0 6 0 8 1 0 0 4 0 2 0 0 TTTTTFTTTTTTTTTTITTTTTTTTTTTTTTTTTTTTTTTTITTTTTTTTTTTTTTTTTITTTTTTTTTTIT Jul93 Jun 94 May 95 Apr 96 Mar 97 Feb 98 Jan 99 Month Figure 7 5 Entropy of state distribution by age actcal data set In the previous commands many graphical options are provided to make the plot look nicer e g customize the axis labels and change the line color and width For a list of available graphical options type plot and par However if you are frightened by all those options you can obtain your first less sophisticated plot just by typing R plot sd Entropy type b 7 2 3 Sequence frequencies Sequence frequency plot The seqfplot function plots the most frequent sequences Each sequence is plotted as a horizontal bar split in as many colorized cells as there are states in the sequence The sequences are ordered by decreasing frequency from bottom up By default the 10 most frequent sequences are plotted However you can select the number of most frequent sequences to plot with the t1im option The next command plots for instance the 10 most frequent sequenc
154. terested in specific events FullTime endPartTime FullTime endLowPartTime FullTime endNoWork FullTime endFullTime NoWork endPartTime NoWork endLowPartTime NoWork NoWork B C endFullTime PartTime endFullTime LowPartTime PartTime endPartTime LowPartTime endLowPartTime PartTime LowPartTime endNoWork PartTime endNoWork LowPartTime interested in the following events in the in the activity calendar actcal data set For instance we may be Table 5 1 Considered events of the activity calendar actcal data set data set Code Status Increase Increasing activity rate Decrease Decreasing activity rate Start Starting an activity Stop Stopping an activity FullTime Starting a full time paid job 37 hours or more per week PartTime Starting a part time paid job 19 36 hours per week LowPartTime Starting a part time paid job 1 18 hours per week NoActivity Starting a period without activity We may thus define the following matrix Remember that the events given on the diagonal of Table 5 2 Events associated to each state transition To state From Full time Part time Low part time No work state A B C D A FullTime Decrease Decrease Stop PartTime LowPart Time B Increase PartTime Decrease Stop FullTime LowPart Time C Increase Increase LowPartTime Stop FullTime PartTime D Start Start Start NoActivity FullTime PartTime LowPartTime
155. th the basic R installation You have to install it explicitly on your system with the package manager 2see Appendix A or an introduction to R manual to see what a factor is 35 36 Ch 5 Importing and handling longitudinal data with TraMineR partner married divorced Hence the original numerical coding is lost If you prefer preserving the numerical coding and losing the labels use the convert factors FALSE option Stata dta format Here is an example of how to import the living arrangement history data from the biographic questionnaire of the Swiss Household Panel SHP We use for that a truncated version of the original shp0_bula_user dta file that can be found on the SHP CD This CD can be obtained on request from the SHP www swisspanel ch The R function to import data sets saved in the Stata dta format is provided by the foreign library and reads read dta It returns a data frame obect The head O function shows the first 6 rows of the imported data set R gt library foreign R gt LA lt read dta data shp0_bvla_user dta R gt head LA idpers q_source bvla_idx bvla013 bvla014 bvlai100 1 4101 2002 1 1965 1989 with both natural parents 2 4101 2002 2 1989 1990 with partner married or not 3 4101 2002 3 1990 1991 with partner married or not 4 4101 2002 4 1991 2002 with partner married or not 5 4102 2002 1 1968 1985 with both natural parents 6 4102 2002 2 1985 1988 alone The summ
156. the default options that is the time axis is a calendar time axis defined taking the minimum and maximum years at which an episode begins and ends R gt LA sts lt seqformat LA id idpers begin bvla013 end bvla014 status bvla100_rec from SPELL to STS process FALSE The resulting STS data contains the living arrangements from the birth of the respondents to the year of the survey 2002 Hence the time at which the first spell begins is the birth year of the respondent Since the oldest respondent in our sample was born in 1914 our time axis is defined from 1914 to 2002 The first case was born in 1965 hence the first valid state appears in the column named y1965 R gt LA sts 1 y1914 y1915 y1916 y1917 y1918 y1919 y1920 y1921 y1922 y1923 y1924 y1925 y1926 1 NA NA NA NA NA NA NA NA NA NA M NM M y1927 y1928 y1929 y1930 y1931 y1932 y1933 y1934 y1935 y1936 y1937 y1938 y1939 5 2 Converting between formats 45 1 NA NA NA NA NA NA NA NA NA NA NA NM MA y1940 y1941 y1942 y1943 y1944 y1945 y1946 y1947 y1948 y1949 y1950 y1951 y1952 1 NA NA NA NA NA NA NA NA NA NA M NM M y1953 y1954 y1955 y1956 y1957 y1958 y1959 y1960 y1961 y1962 y1963 y1964 y1965 1 NA NA NA NA NA NA NA NA NA NA M M 6 y1966 y1967 y1968 y1969 y1970 y1971 y1972 y1973 y1974 y1975 y1976 y1977 y1978 1 6 6 6 6 6 6 6 6 6 6 6 6 6 y1979 y1980 y1981 y1982 y1983 y1984 y1985 y1986 y1987 y1988 y1989 y1990 y1991 1 6 6 6 6 6 6 6 6 6 6 10 10 10 y1992
157. the respondents entered the study at different points in time and we represent the data on a calendar time axis the data could look like this R ex2 cal 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 si 1 1 1 1 1 T 1 3 3 3 NA NA s2 NA 2 2 2 2 2 2 2 3 3 3 NA s3 NA NA 1 1 1 2 2 2 2 2 2 2 58 Ch 6 Creating sequence objects Table 6 1 Start and end of the sequences in the ex1 data set Sequence Start position End position Gap positions sl 4 13 s2 1 10 s3 2 11 s4 1 10 3 4 s5 1 10 2 7 s6 1 10 1 2 3 The meaning of the states does not matter here Respondent 1 entered the study in 1990 and was followed up until 1999 while respondent 3 entered in 1992 and was followed up until 2001 In this case it may be more appropriate to represent the data on a process time axis where all sequences would be left aligned meaning that their common origin is not a specific year but the beginning of the observed 10 year spell R gt ex2 proc Ti T2 T3 T4 T5 T6 T7 T8 T9 T10 st 1 1 f 1 1 1 1i 3 3 3 s2 22 IM 20 2 2 2 2 3 3 es 1 1 1 2 2 2 2 2 2 2 Internal gaps Sequences may also contain gaps i e some unknown statuses inside the se quence due to non response or other reasons 6 5 2 Handling the different kinds of missing values From the above discussion we may distinguish three types of elements in the matricial represen tation of sequence data 1 statuses composin
158. the term sequence refers to a sequence of events rather than of states Hence a sequence is considered to be an ordered list of transitions each transition being charac terized by the set of distinct events that must occur for the transition to take place an event cannot appear more than once in a same transition For instance in the sequence Leaving Home Couple gt Marriage First child Leaving Home Couple Marriage First child are events whereas Leaving Home Couple is the transition defined here by two events between the state Has not left home and no partner to the state Has left home and has a partner The distinction between transition and event permits to account for the simultaneity of some events In this chapter we are interested in finding frequent subsequences in our event sequence data set We propose also tools for identifying among frequent subsequences those that discriminate the most between predefined groups such as between men and women for instance A subsequence of x is an event sequence that is formed by a subset of the events of sequence x and that respects the order of the events in z For instance Leaving Home Couple First child is a subsequence of Leaving Home Couple Marriage First child since the order of transitions and events are respected A subsequence is said frequent if it occurs in more than a given minimum number of sequences This
159. thlegend FALSE option for the seqdplot seqfplot and seqiplot functions 63 64 Ch 7 Describing and visualizing sequences For example the following code generates three plots and a legend side by side as shown in Figure 7 1 R gt par mfrow c 2 2 R gt seqiplot biofam seq title Index plot first 10 sequences F withlegend FALSE R gt seqdplot biofam seq title State distribution plot withlegend FALSE R gt segfplot biofam seq title Sequence frequency plot withlegend FALSE pbarw TRUE R gt seqlegend biofam seq Index plot first 10 sequences State distribution plot e T iss ies Es pedes Ese CITITI TTT MENHENENUNI 2 o 8 TAE II NEN S E 3 AAA 0 0 E e s I 0 MEN e i 2 T LL mm pg G6 LT LII E 19 T ILLE E Eg e mi E 17 1 E LITT INN T T T ij T T T T T T T T T T T 1 So r T T T T T T T T T T T T T T 1 al5 al7 al9 a21 a23 a25 a27 a29 al5 al7 ai9 a21 a23 a25 a27 a29 Sequence frequency plot 32 nN S E Parent E Left E E Married S O Left Marr S H Child g E Left Child g El Left Marr Child 2 E Divorced e E 3 o 0 ais a17 al9 a21 a23 a25 a27 a29 Figure 7 1 Legend plotted as an additional graphic 7 2 Describing and visualizing sequence data sets In this section we present fu
160. three parts of the sequences Sequence vl vg Ur s1 1 2 3 0 0 s2 0 0 11 12 13 s3 1 0 12 13 s4 0 3 4 11 12 13 s5 0 2 7 11 12 13 s6 1 2 3 0 11 12 13 e The first part of the sequence is made of the missing values appearing before the first left most valid state element This part can be void if the sequence begins with a valid state The associated vector vl contains the indexes of all missing values appearing before the first leftmost valid state in a sequence Hence vl 1 2 3 for s1 and s6 vl for s2 s4 and s5 and vl 1 for s3 Settings for handling missing values in this part of the sequence are defined with the left option e The second part of the sequence begins with the first leftmost valid state and ends with the last rightmost valid state The associated vector vg contains the indexes of all missing values appearing in this part of the sequence Hence vg 3 4 in s4 vg 2 7 in s5 and vg in s1 s2 s3 and s6 Settings for handling missing values in this part of the sequence are defined with the gaps argument e The third part of the sequence is made of the missing values appearing after the last right most valid state element The associated vector vr contains the indexes of all missing values appearing in this part of the sequence Hence vr 11 12 13 for s2 s4 s5 and 56 vr 12 13 for s3 and vr for
161. tinct states an individual may be in In the previous example the alphabet is taken from the data that is we suppose that all possible states appear in the imported sequences Some options to specify manually the alphabet and other attributes will be described later In the actcal data set sequences are in the STS format see Section 4 2 1 the beloved format used by TraMineR to store data in sequence objects If your data is already in this format you can omit the informat option because STS is its default value You just issue the seqdef function and specify the columns containing the sequence data with the var option if your data contain only sequences and no covariate you can also omit this option As discussed in the previous chapter state sequences may be presented in some non STS format such as SPS for example Even more in some cases sequences are not directly defined as such but can be derived from data originally collected as spells or time stamped events We describe hereafter the formats that TraMineR can read and convert into a sequence object using some real life example data sets The informat option of the segdef function is used to specify the original format of the input data Refer to Section 4 2 for identifying the actual format of your data 6 1 1 Creating a sequence object from SPS formatted data In the SPS format Section 4 2 2 sequences are defined with state duration couples The next example shows the
162. to provide an additional data file containing the birth years of the respondents as described before see 5 2 2 R head shp birthyr idpers birthm birthy sex 1 2713 7 1965 1 2 2714 T3 1968 2 3 2T15 3 1991 1 4 2716 3 1993 1 5 2717 3 1996 2 6 3713 6 1961 1 The sequences are created by using the pdata and pvar options R gt LA seq lt seqdef LA var c idpers bvla013 bvla014 bvlai00 informat SPELL states LA states labels LA labels process TRUE pdata shp birthyr pvar c idpers birthy Now the sequence for the first individual in the data begins at age 1 he was born in 1965 and his living arrangement history begins in 1965 He has been in state 4 with both natural parents during 24 years and then in state 10 with a partner during 13 years R gt LA seq 1 2 Sequence 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 10 10 10 10 10 10 10 10 10 10 10 10 10 10 2 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 1 10 10 10 10 10 10 10 10 10 10 10 10 10 10 6 2 Attributes of sequence objects When creating a sequence object with the seqdef function several attributes are stored together with the sequence data namely e the alphabet e the color palette used for representing states in plots e optional state labels e the code used for missing values e the starting time of the sequences Those attributes are retrieved by other TraMineR functions for instance the alphabet color pale
163. tte and state labels associated to the object are used by the TraMineR sequence plotting functions If no values for the attributes are provided by the user those are set automatically The default values and the user available options to override them are described below 2The color palette and state labels can be overridden by options to the plotting functions 52 Ch 6 Creating sequence objects 6 2 1 State codes In a sequence object the variables columns where the states composing the sequences are stored are R factors A R factor has an internal numeric code and a label It resembles the numerically coded variables with value labels we found in SPSS or Stata When importing data from statistical softwares such as SPSS or Stata all variables with value labels are converted into R factors unless you specify it otherwise When creating a sequence object if you do not specify yourself the list of possible states TraMineR uses the factor levels i e the value labels to create the alphabet To illustrate we go back to our SPELL data set described in Section 6 1 2 If we create a sequence object using the state labels present in the data it would look like this R gt print LA B seql1 J format SPS Sequence 1 with both natural parents 23 with partner married or not 14 The alphabet would be made of the factor levels R gt alphabet LA B seq 1 alone 2 no answer 3 other situation 4 with both nat
164. uences 3 and 5 Elzinga proposes as first measure of distance between sequences x and y dp z y x ly 2Ap v y where Ap x y is the LLCP between sequences x and y 1see Elzinga 2008 for a more complete explanation 94 Ch 9 Measuring similarities and distances between sequences 9 2 2 Computing LCP distances The LCP distances can be computed with the segdist function by specifying the method LCP option The following example reproduces the results shown in the lower triangle of Table 4 in Elzinga 2008 R gt seqdist famform seq method LCP 1 21 3 4 5 ii 0 1 2 3 5 2 1 0 1 2 6 3 2 1 0 1 7 4 3 2 1 0 8 5 5 6 7 8 0 Elzinga suggests also a normalized LCP metric that is insensitive to the length of the sequences namely Dp x y 1 Sp z y with Ap x y y This normalized metric is obtained with the option norm TRUE Sp z y R gt seqdist famform seq method LCP norm TRUE 1 2 3 4 5 1 0 0000000 0 1835034 0 2928932 0 3675445 1 2 0 1835034 0 0000000 0 1339746 0 2254033 3 0 2928932 0 1339746 0 0000000 0 1055728 4 0 3675445 0 2254033 0 1055728 0 0000000 5 1 0000000 1 0000000 1 0000000 1 0000000 Orbe Those who prefer similarity measures can easily get them by taking the complement to one of the normalized distance values Sp z y 1 Dp x y R gt 1 seqdist famform seq method LCP norm TRUE 1
165. ul facilities for describing sets of sequences TraMineR is to our knowledge the first such toolbox for the free R statistical and graphical environment Our objective with TraMineR is to put together most of the features proposed separately by other softwares as well as offering original tools for extracting useful knowledge from sequence data Its salient characteristics are e Rand TraMineR are free e Since TraMineR is developed in R it takes advantage of many already optimized procedures of R as well as of its powerful graphics capabilities IR demo scripts named Describing visualizing Similarities and Event sequences are in the demo directory of the package tree and can be run by means of the demo for instance demo Describing_visualizing package TraMineR for the first one 10 1 1 Aims and features of the TraMineR package 11 e R runs under several OS including Linux MacOS X Unix and Windows A same R program runs unmodified under all operating systems The same is indeed true for R packages and hence for TraMineR e TraMineR features a unique set of procedures for analysing and visualizing sequence data such as handling a large number of state and time stamped event sequence representations simple functions for transforming to and from different formats individual sequence summaries and summaries of sequence sets selecting and displaying the most frequent sequences or subsequences various met
166. ural parents 5 with both natural parents and friends or flat share 6 with both natural parents and the partner married married 7 with friends or in a flat share 8 with one parent alone 9 with one parent and his her new partner 10 with partner married or not 11 with partner married or not and friends or flat share 12 with relatives or in a foster family Hence if states in the original data set are represented by labels it may be useful to change the state labels to shorter symbols in the plots one can still optionally specify a more descriptive legend of the represented states This can be done when creating the sequence object with the states option When creating the La seq sequence object we specified the states 1 12 option to code the states as numbers ranging from 1 to 12 The sequence object is much more readable when it is displayed R gt print LA seq 1 format SPS Sequence 1 4 23 10 14 R alphabet LA seq fi 1 2 39 2 5 6 7 amp 9 10 11 12 6 2 Attributes of sequence objects 53 6 2 2 Alphabet If you create a sequence object without specifying the alphabet option all possible states are supposed to be present in the data set and the alphabet is set by listing the distinct states en countered However in some cases we may have to consider states that are not present in the data set used to create the sequence object Suppose for instance that you want t
167. used to automatically compare the version numbers of installed packages with the newest available version on the repos itories and update outdated packages on the fly Informations on new features added to updated versions of the package are described in the NEWS file see http cran r project org web packages TraMineR index html 3 2 Data sets included in the TraMineR package Several sequence data sets used in this manual are included in the TraMineR package and can be loaded in memory using the data function The actcal and biofam data sets were created from the Swiss Household Panel SHP data http www swisspanel ch lyou can use seqdef or help seqdef or the reference manual to see what the expected arguments are 2Those example data sets are random samples drawn from the original files and are only used for documenting the package Persons interested in using the data from the Swiss Household Panel for their research must sign a data protection contract to get access to the complete and original files 3 2 Data sets included in the TraMineR package 21 3 2 1 The actcal data set The next example shows how to load the actcal data set list the names of its columns and display the content of the first row You may get an overview and summary statistics of the whole actcal data set by issuing the summary actcal command output not shown R gt data actcal R gt names actcal 1 idhous00 age00 educat00 civ
168. vad Southern factor mvad Southern labels yn R mvad S Eastern factor mvad S Eastern labels yn R mvad Western factor mvad Western labels yn R mvad Grammar factor mvad Grammar labels yn R mvad funemp factor mvad funemp labels yn R gt mvad gcsebeg lt factor mvad gcse5eq labels yn R mvad fmpr factor mvad fmpr labels yn R mvad livboth factor mvad livboth labels yn Now we summarize the data frame R gt summary mvad 1 17 id weight male catholic Belfast N Eastern Min 1 0 Min 0 1300 no 342 no 368 no 624 no 503 1st Qu 178 8 ist Qu 0 4500 yes 370 yes 344 yes 88 yes 209 Median 356 5 Median 0 6900 Mean 356 5 Mean 0 9994 3rd Qu 534 2 3rd Qu 1 0700 Max 712 0 Max 4 4600 Southern S Eastern Western Grammar funemp gcsebeq fmpr 38 Ch 5 Importing and handling longitudinal data with TraMineR no 497 no 629 no 595 no 583 no 595 no 452 no 537 yes 215 yes 83 yes 117 yes 129 yes 117 yes 260 yes 175 livboth Jul 93 Aug 93 Sep 93 no 2261 Min 1 000 Min 1 00 Min 1 000 yes 451 1st Qu 2 000 Tst Qu 2 00 ist Qu 1 000 Median 3 000 Median 3 00 Median 2 000 Mean 23006 Mean 3 15 Mean 2 881 3rd Qu 5 000 3rd Qu 4 00 ard Qu 3 000 Max 5000 Max 25 00 Max 5 000 5 1 3 Data storage in R A set of sequences i e vectors or strings of states or events can be stored in several kinds of R objects namely vectors matrices or data fra
169. we display the first event sequence from the event sequence objects created respectively with the transition default state and period method R gt bf seqe 1 1 Parent 9 00 Parent gt Left Marr 1 00 Left Marr gt Left Marr Child 6 00 R gt bf seqestate 1 1 Parent 9 00 Left Marr 1 00 Left Marr Child 6 00 R gt bf seqeperiod 1 1 Parent 9 00 endParent Left Marr 1 00 endLeft Marr Left Marr Child 6 00 Event sequences are represented using the following form 10 2 Searching for frequent event subsequences 107 e1 e2 elapsedtime e2 endtime Where elapsedtime is the the time elapsed between two consecutive sets of events e1 e2 is a transition that is a non empty list of simultaneous events and endtime is the time elapsed between the last transition and the end of observation The string representing the first sequence transition method means that the trajectory described starts at time 0 with the Parent event meaning that at the first observed age the concerned person is living with her his parents which is followed nine years later by the event Parent gt LeftMarr leaving home and marrying and one year later LeftMarr gt LeftMarrChild first child which occurs 6 years before the end of the 16 years of observation 10 2 Searching for frequent event subsequences The function seqefsub searches for frequent event subsequences It takes at least two argum
170. ween May and November 2000 and had no remunerated job in December 2000 Note that row names are arbitrary character strings that can be easily modified we explain how in the appendix see paragraph A 3 4 p 118 3 2 2 The biofam data set The biofam data set was constructed by M ller et al 2007 from the data of the retrospective biographical survey carried out by the Swiss Household Panel in 2002 In includes only individuals Table 3 1 State definition for the activity calendar actcal data set Code Status full time paid job 37 hours or more per week part time paid job 19 36 hours per week part time paid job 1 18 hours per week no work unemployment other aao w S gt 22 Ch 3 The TraMineR package Table 3 2 Covariates and state variables of the activity calendar actcal data set Variable Label age00 age in 2000 educat00 education level in 2000 civsta00 civil status of the respondent in 2000 nbadul00 number of adults in the household nbkid00 number of children under 15 in the household aoldkid00 age of the oldest kid in the household ayoukid00 age of the youngest kid in the household region00 region the household is living in com2 00 type of community the household is living in sex sex of the respondent birthy birth year of the respondent jan00 activity status for January 2000 dec00 activity status for
171. xtract the entropy measures with sd Entropy By the way we illustrate also how we can save the graphic in a pdf file so that it can for instance be inserted into this manual To do this we open a pdf file with the pdf O function create the graphic with the plot command and close the pdf file with the dev off function The result is shown in figure Of course if you want to run this program on your system you should adapt the path to the pdf file to your convenience Users who prefer to save their graphics in the postscript format can use postscript instead of pdf O There are likewise pngO jpegQ functions R gt sd seqstatd biofam seq R pdf file Graphiques fg biofam entropy pdf width 8 height 8 pointsize 14 R plot sd Entropy main Entropy of biofam state distribution by age xlab Age ylab Entropy type h dwd 3 5 col blue axes F cex 1 3 frame plot T R gt axis 1 labels names biofam seq at 1 16 R gt axis 2 at c 0 0 2 0 4 0 6 0 8 1 R gt dev off Entropy of biofam state distribution by age 0 8 Entropy 0 2 a15 a17 a19 a21 a23 a25 a27 a29 Age Figure 7 4 Entropy of state distribution by age biofam data set If you prefer a line instead of vertical bars you just have to change the plot type from h to Y as when plotting the state distribution entropy for the mvad dataset Fig 7 5 R gt mvad sd seqs
172. y state variables from July 1993 to June 1999 for 712 individuals All the following commands show the process of analysing a sequence data set and can be issued by a user who has R and TraMineR installed on his computer 2 1 State sequence analysis 1 Loading the TraMineR library and the mvad example data set R gt library TraMineR R gt data mvad 2 Defining a vector containing the legends for the states to appear in the graphics and creating a sequence object which will be used as argument to the next functions see Chapter 6 R gt mvad labels lt c employment further education higher education joblessness school training R gt mvadi scode lt e EM FEN HE CILY USC JTRN R gt mvad seq lt seqdef mvad 17 86 states mvad scode labels mvad labels 3 Drawing in a single figure Fig 2 1 e the index plot of the 10 first sequences see Section 7 3 R gt seqiplot mvad seq withlegend F title Index plot 10 first sequences e the sequence frequency plot of the 10 most frequent sequences with bar width propor tional to the frequencies see Section 7 2 R gt seqfplot mvad seq pbarw T withlegend F title Sequence frequency plot e the state distribution by time points see Section 7 2 2 R gt seqdplot mvad seq withlegend F title State distribution plot e the legend as a separate graphic since several plots use the same color codes for the states 1
173. y1993 y1994 y1995 y1996 y1997 y1998 y1999 y2000 y2001 y2002 1 10 10 10 10 10 10 10 10 10 10 10 It is also possible to convert directly into the more concise state permanence format by setting the to option to SPS Using the compressed TRUE option produces compressed sequences character strings We can see that the first sequence begins with 51 1965 1914 missing states coded as bin R gt LA sps lt seqformat LA var c idpers bvla013 bvla014 byla100_rec from SPELL to SPS compressed TRUE process FALSE R gt head LA sps Sequence 1 51 6 24 10 14 2 54 6 17 12 4 10 14 3 x 47 6 17 8 5 9 1 8 1 9 14 12 3 4 59 6 20 10 10 5 89 6 11 14 28 4 50 Now we convert our data using a process time axis We need therefore some additional infor mations namely the respondents birth years in order to compute the ages at which the spells begin and end Those informations are provided as a separate data set containing only one row for each individual The data contains the respondents id as well so as to match the information on birth year with the spell data In addition to the birth year it contains the birth month and the sex of each respondent Here is an extract of this data set R gt head shp birthyr idpers birthm birthy sex 1 4101 7 1965 1 2 4102 11 1968 2 3 4103 3 1991 1 4 4104 3 1993 1 5 4105 3 1996 2 6

Mining sequence data in R with the TraMineR package

Contents

Download Pdf Manuals

Related Search

Related Contents