Home

PQStat User Guide

1. Og _ Eg 2 E 1 7 where Ng the number of observations in group g The statistic asymptotically for large sizes has the y distribution with G 2 degrees of freedom On the basis of test statistics p value is estimated and then compared with a fp lt a wereject Ho and accept H1 ifp gt a gt _ there is no reason to reject Ho AUC the area under the ROC curve The ROC curve built on th ebasis of the value of the dependent variable and the predicted probability of dependent variable P allows to evaluate the ability of the constructed logistic regression model to classify the cases into two groups 1 and 0 The constructed curve especially the area under the curve presents the classification quality of the model When the ROC curve overlaps with the diagonal y x then the decision about classifying a case within a given class 1 or 0 made on the basis of the model is as good as a random division of the studied cases into the groups The classification quality of a model is good when the curve is much above the diagonal y zx that is when the area under the ROC curve is much larger than the area under the y x line i e itis greater than 0 5 Hypotheses Ho AUC 0 5 H AUC 0 5 The test statistic has the form presented below AUC 0 5 Paa S Eo 5 where S E o 5 area error Statistics Z asymptotically for large sizes has the normal distribution On the basis of test s
2. 6 6 i 8 8 8 8 8 In the descriptive statistics window you need to select all procedures that you want to follow for example mean standard deviation minimum maximum and the variable for an analysis the column including height and then confirm your choice by clicking OK If you reduce a datasheet workspace by selecting a coherent piece of data the following message in the analysed window will occur 3 You can use saved selection If selected ranges are ascribed to the sheet they are highlighted by a frame They can be used in the analysis where the data can be set directly to the analysis window Then clicking on fill with saved selection button data from the selected range can be pasted Copyright 2010 2014 PQStat Software All rights reserved 44 4 HOW TO ORGANISE WORK WITH PQSTAT EXAMPLE 4 5 layers pas file We want to designate statistics associated with Odds Ratio OR for a few stratas We will use some data saved in 10 tables they are selected framed From the Statistics menu we select Stratified analysis Mantel Haenszel OR RR In the test options window we select contingency table then we set the number of stratas 10 Each created strata can be filled from the selected range If we fill all the tables we make analysis by clicking OK button Note To ascribe more selections to the data sheet from the Edition menu we chose Save selection Ctrl T To delete ascribed selections we chose
3. bronchitis others pneumonia The agreement with a chance adjustment amp 44 58 is smaller than the one which is not adjusted for the chances of an agreement The p value lt 0 000001 Such result proves an agreement between these 2 doctors opinions on the significance level 0 05 Copyright 2010 2014 PQStat Software All rights reserved 205 16 DIAGNOSTIC TESTS 16 DIAGNOSTIC TESTS 16 1 EVALUATION OF DIAGNOSTIC TEST Suppose that using a diagnostic test we calculate the occurrence of a particular feature most often disease and know the gold standard so we know that the feature really occurs among the examined people On the basis of these information we can build a 2 x 2 contingency table Observed frequencies Reality gold standard disease free C negative result N mN FNN Total TP FN FP TN n TP FP FN TN positive result TP FP TPHFP diagnostic test where TP true positive FP false positive FN false negative TN true negative For such a table we can calculate the following measurements e Sensitivity and specificity of diagnostic test Every diagnostic test in some cases can obtain results different than actual results for example a diagnostic test basing on the obtained parameters classifies a patient to the group of people suffering from a particular disease or to the group of healthy people In reality the number of people approved for t
4. the 7 is an unbiased estimator of the population parameter 7 while the r is a biased estimator of the population parameter Rs 14 2 2 The test of significance for the Spearman s rank order correlation coefficient The test of significance for the Spearman s rank order correlation coefficient is used to verify the hy pothesis determining the lack of monotonic correlation between analysed features of the population and it is based on the Spearman s rank order correlation coefficient calculated for the sample The closer to O the value of r is the weaker dependence joins the analysed features Basic assumptions measurement on an ordinal scale or on an interval scale Hypotheses Hos dig His R 0 Copyright 2010 2014 PQStat Software All rights reserved 184 14 CORRELATION The test statistic is defined by Ts t SE 1 r2 where SE n 2 The value of the test statistic can not be calculated when r 1 lub r 1 or whenn lt 3 The test statistic has the Student distribution with n 2 degrees of freedom The p value designated on the basis of the test statistic is compared with the significance level a ifp lt a gt reject Ho and accept H ifp gt a gt _ thereis no reason to reject Ho The settings window with the Spearman s monotonic correlation can be opened in Statistics menu gt NonParametric tests ordered categories monotonic correlation r Spearman or in Wizard
5. Monotonic correlation r Spearman Statistical analysis Spearman s monotonic correlation Y Variable 1 Varable 2 i age Zhight hit Data Filter set of the conditions that are applied to data to se produce a subset of your data All the rules are combined using the logical AND basic multiple AND Report options E Add analysed data V Add graph 0 EXAMPLE 14 1 continuation age height pqs file Hypotheses Ho there is no monotonic dependence between age and height for the population of children attending to the analysed school H there is a monotonic dependence between age and height for the population of children attending to the analysed school Copyright 2010 2014 PQStat Software All rights reserved 185 STAT 14 CORRELATION Analysis time Analysed variables Significance level Size number of pairs r 0 839739 Std err of r 0 14512 95 CI for r coefficient 0 578742 95 CI for r coefficient 0 944696 t statistic for r 5 786513 Degrees of freedom 14 p value 0 000047 xy 165 age Comparing the p value 0 000047 with the significance level a 0 05 we draw the conclusion that there is a monotonic dependence between age and height in the population of children attending to the analysed school This dependence is directly proportional it means that children grow up as they get older The Spearman s rank
6. Note Function Z can containt variable interactions in such a case we introduce into the model a variable which is the result of multiplying the variables participating in the interaction e g Xi x Xo The logit is the transformation of that model into the form not z The matrices involved in the equation for a sample of size n are recorded in the following manner Bo Y L Cit T wn Spi i Y2 1 T12 WOO TK B a on E x p i Yn lL 4 24 ss Thn B In such a case the solution of the equation is the vector of the estimates of parameters 6o 61 Ok called regression coefficients The coefficients are estimated with the use of the maximum likelihood method that is through the search for the maximum value of likelihood function L in the program the Newton Raphson iterative algorithm was used On the basis of those values we can infer the magnitude of the effect of the independent variable for which the coefficient was estimated on the dependent variable There is a certain error of estimation for each coefficient The magnitude of that error is estimated from the following formula SE Vdiag H p where diag H is the main diagonal of the covariance matrix Note When building the model you need remember that the number of observations should be ten times greater than or equal to the number of the estimated parameters of the model n gt 10 k 1 Note When building the model you need remember that
7. eigenvalues Qi Qil Qi2 Qip eigenvector corresponding to the ith eigenvalue M the variance matrix or covariance matrix of original variables X1 X9 Xp I identity matrix 1 on the main diagonal O outside of it 18 1 1 The interpretation of coefficients related to the analysis Every principal component is described by Eigenvalue An eigenvalue informs about which part of the total variability is explained by a given principal component The first principal component explains the greatest part of variance the second principal component explains the greatest part of that variance which has not been explained by the previous component and the subsequent component explains the greatest part of that vari ance which has not been explained by the previous components As a result each subsequent principal component explains a smaller and smaller part of the variance which means that the subsequent values are smaller and smaller Total variance is a sum of the eigenvalues which allows the calculation of the variability percent age defined by each component i _________ 100 Ay Ag 4 vee Ap Consequently one can also calculate the cumulative variability and the cumulative variability percentage for the subsequent components Eigenvector An eigenvector reflects the influence of particular original variables on a given principal compo nent It contains the aj1 i2 Qip Coefficients of a lin
8. 86 Wilson E B 1927 Probable Inference the Law of Succession and Statistical Inference Journal of the American Statistical Association 22 158 209 212 87 Yates F 1934 Contingency tables involving small numbers and the chi square test Journal of the Royal Statistical Society 1 2 17 235 88 Yule G 1900 On the association of the attributes in statistics With illustrations from the material ofthe childhood society and c Philosophical Transactions of the Royal Society Series A 194 257 3 19 89 Zweig M H Campbell G 1993 Receiver operating characteristic ROC plots a fundamental evaluation tool in clinical medicine Clinical Chemistry 39 561 577 Copyright 2010 2014 PQStat Software All rights reserved 318
9. EXAMPLE 11 7 cont sex exam pqs file You know that 55 56 out of all the women in the sample who passed the exam and 25 00 out of all the men in the sample who passed the exam This data can be written in two ways as a numerator and a denominator for each sample or as a proportion and a denominator for each sample numerator women denominator wome numerator men denominator men 50 90 20 au proportion women denominator wome proportion men denominator men 0 555555555556 30 0 25 ou Hypotheses Ho The proportion of the men who passed the exam is the same as the proportion of the women who passed the exam in the analysed population H The proportion of the men who passed the exam is different than the proportion of the women who passed the exam in the analysed population Analysis time Analysed variables Var2 Vara Varad Vars Significance level 0 05 Continuity correction No Difference of the proportions 0 305556 95 CI for the difference of the proportions 0 158695 95 CI for the difference of the proportions 0 433518 NNT offer es 95 CI NNT 2 306711 95 CI NNT 6 301412 Z statistic 4 04047 p value asymptotic 0 0000553 9530 0 45 m Dif of proportions 0 4 0 35 0 3 0 25 Difference of the proportions 0 2 0 15 Copyright 2010 2014 PQStat Software All rights reserved 135 11 COMPARISON 2 GROUPS Note It is necessary to select the appropriate area d
10. Interval scale Ordinal scale Nominal scale Are the data dependent Are the data dependent Are the data normally distributed ANOVA Y N Friedman Kruskal Q Cochran ANOVA Wallis ANOVA Are the data ANOVA f e dependent e multidimentional dependent Kolmogorov Smirnov x test groups or Lilliefors test Are the variances equal Brown Forsythe Levene test ANOVA for independent groups Note Note that simultaneous comparison of more than two groups can NOT be replaced with multiple per formance the tests for the comparison of two groups It is the result of the necessity of controlling the type error a Choosing the a and using the k fold selected test for the comparison of 2 groups we could make the assumed level much higher a It is possible to avoid this error using the ANOVA test Analysis of Variance and contrasts or the POST HOC tests dedicated to them Copyright 2010 2014 PQStat Software All rights reserved 144 12 COMPARISON MORE THAN 2 GROUPS D 12 1 PARAMETRIC TESTS 12 1 1 The ANOVA for independent groups The one way analysis of variance ANOVA for independent groups proposed by Ronald Fisher is used to verify the hypothesis determining the equality of means of an analysed variable in several k gt 2 populations Basic assumptions measurement on an interval scale normality of distribution of an analysed feature in each population an i
11. Xk independent variables explanatory Bo 81 82 Bk parameters random parameter model residual lf the model was created on the basis of a data sample of size n the above equation can be presented in the form of a matrix Y XP e where Bo Y L gir Z2 sss ki Bi 1 Y2 l 412 T22 XK E2 Y 0 pXs 2 22 p8 amp lt Yn Zin Ton as Tin l En Br In such a case the solution of the equation is the vector of the estimates of parameters 6o 61 Ox called regression coefficients Copyright 2010 2014 PQStat Software All rights reserved 228 17 MULTIDIMENSIONAL MODELS sl Those coefficients are estimated with the help of the classical least squares method On the basis of those values we can infer the magnitude of the effect of the independent variable for which the coef ficient was estimated on the dependent variable They inform by how many units will the dependent variable change when the independent variable is changed by 1 unit There is a certain error of estima tion for each coefficient The magnitude of that error is estimated from the following formula 1 a eet n k 1 eFe X7X where e Y Y is the vector of model residuals the difference between the actual values of the dependent variable Y and the values Y predicted on the basis of the model Note When constructing the model one should remember that the number of observations has to be greater than or equal
12. house e We enter functions by double clicking on the name of the selected function The name then appears in the edition field of the formula Alternatively we can enter the name directly in the edition field In such a case the capitalization of the letters in the name of the function does not matter The function arguments are given in brackets with the use of the syntax given in the description of the function Formula results The results of the formulas will be displayed in the selected column If among the arguments of the function there will be values which the function cannot interpret the program will display a message asking whether the uninterpreted data ought to be omitted A confir mation will cause a recalculation of the formula without the uninterpreted data If a negative answer is given the error value NA will be returned For example for values in columns v1 v2 and v3 respec tively 1 2 ada the sum function sum v1 v2 v3 will return the result 3 if we skip the uninterpreted value ada or will return NA if we do not skip that value in the calculations An empty value missing data will only be returned when all the arguments used in the formula are Copyright 2010 2014 PQStat Software All rights reserved 17 STM 3 WORKING WITH DOCUMENTS empty The number of rows taking part in the formula can be limited by selecting an appropriate range of rows in the datasheet and by selecting the optio
13. 0 0987654 0 0404362 0 0157274 0 0 9 12 39 0 3943661 0 6056338 0 6143330 0 0807573 0 1637426 0 0549303 0 0182830 0 12 15 18 0 2424242 0 7575757 0 3720608 0 0300655 0 0919540 0 0603812 0 0139645 0 15 18 11 0 4210526 0 5789473 0 2818642 0 0395598 0 1777777 0 0602764 0 0172650 0 08 H l I 4 4 L 18 21 l 4 o 1 0 1631845 0 0 0 0570648 0 21 24 2 o 1 0 1631845 0 0570648 For each 3 year period of time we can interpret the results obtained in the table for example for people living for at least 9 years after the transplantation who are included in the range 9 12 the number of people who survived 9 years after the transplantation is 39 there are 7 people about whom we know they had lived at least 9 12 years at the moment the information about them was gathered but we do not know if they lived longer as they were left out of the study after that time the number of people at the risk of death in that age range is 36 there are 14 people about whom we know they died 9 to 12 years after the transplantation 39 4 of the endangered patients died 9 to 12 years after the transplantation 60 6 of the endangered patients lived 9 to 12 years after the transplantation the percent of survivors 9 years after
14. Ol O12 On 03 01 O87 O12 09 n yi ny The Manel Hanszel y test for the RRjj 17 The Mantel Haenszel Chi square test for the RA sz is used in the hypothesis verification about the significance of designated relative risk RRm p It should be calculated for large frequencies in a contingency table Hypotheses Ho RRyw 1 H RRyy 1 Copyright 2010 2014 PQStat Software All rights reserved 172 13 STRATIFIED ANALYSIS sl The test statistic is defined by PA oR XMH y where a 406 0 40 EM 01 ton J O 018 are the expected frequencies in the first contingency table cell for individual stratas s 1 2 w This statistic asymptotically for large frequencies has the y distribution with 1 degree of free dom The p value designated on the basis of the test statistic is compared with the significance level Q fp lt a reject Ho and accept H fp gt a gt there is no reason to reject Ho The y test of homogeneity for the RR The Chi square test of homogeneity for the RR is used in the hypothesis verification that the variable creating stratas is the modifying effect i e it influences on the designated relative risk in the manner that the relative risks are significant different for individual stratas Hypotheses Ho RRmy RR for all the stratas s 1 2 w H RRMH x RR for at least one strata The test statistic using weighted least sq
15. The settings window with the Mann Whitney U test can be opened in Statistics menu NonPara metric tests ordered categories Mann Whitney or in Wizard Mann Whitney Statistical analysis Mann Whitney U test Test options time hours ime hours V Use the grouping variable Ta Continuity comection Data Filter set of the conditions that are applied to data to ae produce a subset of your data All the rules are combined using the logical AND basc mutiple W Report options E Add analysed data Add graph Ee ee EXAMPLE 11 2 computer pqs file There was made a hypothesis that at some university male math students spend statistically more time in front of a computer screen than the female math students To verify the hypothesis from the popula tion of people who study math at this university there was drawn a sample consisting of 54 people 25 women and 29 men These persons were asked how many hours they spend in front of the computer screens daily There were obtained the following results time sex 2 k 2 m 2 m 3 k 3 k 3 k 3 k 3 m 3 m 4 k 4 k 4 k 4 k 4 m 4 m 5 k 5 k 5 k 5 k 5 k 5 k 5 k 5 k 5 k 5 m 5 m 5 m 5 m 6 k 6 k 6 k 6 k 6 k 6 m 6 m 6 m 6 m 6 m 6 m 6 m 6 m 7 k 7 m 7 m 7 m 7 m 7 m 7 m 7 m 7 m 7 m 8 k 8 m 8 m Hypotheses Ho the m
16. Very often the aim of studies is the comparison of the size of the area under the ROC curve AU Ci with the area under another ROC curve AUC gt The ROC curve with a greater area usually allows a more precise classification of objects Copyright 2010 2014 PQStat Software All rights reserved 217 STM 16 DIAGNOSTIC TESTS Methods for comparing the areas depend on the model of the study e Dependent model the compared ROC curves are constructed on the basis of measurements made on the same objects Hypotheses Ho AUC AUC H AUC 4 AUCs The test statistics has the form presented below _ AUC AUC2 SE Aauc AUC gt where AUC AUC and the standard error of the difference in areas SE 4uc Auc are calculated on the basis of the nonparametric method proposed by DeLong DeLong E R et al 1988 26 Hanley J A and Hajian Tilaki K O 1997 38 Statistics Z has for large sizes asymptotic normal distribution The p value designated on the basis of the test statistic is compared with the significance level Q fp lt a reject Ho and accept H ifp gt a gt there is no reason to reject Ho The window with settings for comparing dependent ROC curves is accessed via the menu Statis tics Diagnostic tests Dependent ROC Curves comparison i Dependent ROC curves comparison a Statistical analysis Dependent ROC curves comparison Diagnostic vanable 1
17. sound lewel meter sound lewel metar Ill 80 81 a2 83 B4 85 86 af Mean sound level meter sound level meter III Bland Altman Plot sound lewel meter I sound lewel meter Ill 80 81 82 83 B4 85 86 af Mean sound level meter Il sound level meter III 15 2 NONPARAMETRIC TESTS 15 2 1 The Kendall s coefficient of concordance and the test of its significance The Kendall s W coefficient of concordance is described in the works of Kendall Babington Smith 1939 43 and Wallis 1939 80 It is used when the result comes from different sources from different judges and concerns a few k gt 2 objects However the assessment concordance is necessary Is often used in measuring the interjudge reliability strength the degree of judges assessment concordance The Kendall s coefficient of concordance is calculated on an ordinal scale or a interval scale Its value is Copyright 2010 2014 PQStat Software All rights reserved 199 15 AGREEMENT ANALYSIS sl calculated according to the following formula gaU 3n k k 1 n2k k2 1 nC where n number of different assessments sets the number of judges k number of ranked objects k n 2 v Sora j 1 i l Ri ranks ascribed to the following objects 7 1 2 k independently for each judge C 1 250 C Se t a correction for ties t number of cases incorporated into tie The coefficient s formula includes C
18. there is a statistically significant difference among populations means medians proportions distributions etc Researcher must formulate the hypotheses in the way that it is compatible with the reality and Statistical test requirements for example Ho the percentage of women and men running their own businesses in an analysed population is exactly the same If you do not know which percentage men or women in an analysed population might be greater the alternative hypothesis should be two sided It means you should not assume the direction H the percentage of women and men running their own businesses in an analysed population is different Copyright 2010 2014 PQStat Software All rights reserved 81 9 HYPOTHESES TESTING It may happen but very rarely that you are sure you know the direction in an alternative hypoth esis In this case you can use one sided alternative hypothesis The 2nd step Verify which of the hypotheses Ho or H is more probable Depending on the kind of an analysis and a type of variables you should choose an appropriate statistical test Note 1 Note that choosing a statistical test means mainly choosing an appropriate measure ment scale interval ordinal nominal scale which is represented by the data you want to analyse It is also connected with choosing the analysis model dependent or inde pendent Measurements of the given feature are called dependent paired when the
19. yes Size STATE no AUC SE AUC 95 CI 95 CI AUCI AUCZ SE AUC1 AUC2 zZ statistic p value Sensitiviby 0 02 Copyright 2010 2014 PQStat Software All rights reserved 04 0 6 1 Specificity 0 8 16 DIAGNOSTIC TESTS PCT bacteremia 2 0 05 sex f m f 75 16 59 0 864936441 0 079165996 0 709773958 0 911764706 0 059920399 0 794322908 1 0 046828265 0 099285997 0 47165025 0 637176453 222 16 DIAGNOSTIC TESTS A The calculated areas are AUCs 0 8649 AUCm 0 9118 Therefore on the basis of the adopted level 0 05 based on the obtained value p 0 6372 we conclude that we cannot select the sex for which PCT parameter is better for diagnosing bacteremia Copyright 2010 2014 PQStat Software All rights reserved 223 17 MULTIDIMENSIONAL MODELS sl 17 1 PREPARATION OF THE VARIABLES FOR THE ANALYSIS IN MULTIDIMENSIONAL MODELS 17 MULTIDIMENSIONAL MODELS 17 1 1 Variable coding in multidimensional models When preparing data for a multidimensional analysis there is the problem of appropriate coding of nominal and ordinal variables That is an important element of preparing data for analysis as it is a key factor in the interpretation of the coefficients of a model The nominal or ordinal variables divide the analyzed objects into two or more categories The dichotomous variables in two categories k 2 must only be appropriately coded whereas the variables with many categ
20. 1 150 minutes or less 2 more than 150 minutes 3 the amount of minutes coming from the range T sd sd 148 12min 174 18min 4 the amount of minutes out of the range T sd Open the Probability distribution calculator window select Gaussian distribution and write the mean x 161 15min and standard deviation sd 13 03min and select the option which indicates that you are going to calculate the p value 1 To calculate using normal distribution Gauss the probability that the client you have chosen used 150 free minutes or less put the value of 150 in the Statistic field Confirm all selected settings by clicking Calculate N 161 15 13 03 150 Copyright 2010 2014 PQStat Software All rights reserved 78 Th 8 PROBABILITY DISTRIBUTIONS SA The obtained p value is 0 193961 Note Similar calculations you can carry out on the basis of empirical distribution The only thing you should do is to calculate a percentage of clients who use 150 minutes or less example 6 1 by using the Frequency tables window In the analysed sample which consists of 200 clients there are 40 clients who use 150 minutes or less It is 20 of the whole sample so the probability you are looking for is p 0 2 2 To calculate the probability using the normal distribution Gauss that the client who you have chosen used more than 150 free minutes you need to put the value of 150 in the Statistic field and
21. 16 Brown M B Forsythe A B 1974a Robust tests for equality of variances Journal of the American Statistical Association 69 364 367 17 Brown W 1910 Some experimental results in the correlation of mental abilities British Journal of Psychology 3 296 322 18 Clopper C and Pearson S 1934 The use of confidence or fiducial limits illustrated in the case of the binomial Biometrika 26 404 413 19 Cochran W G 1950 The comparison ofpercentages in matched samples Biometrika 37 256 266 20 Cochran W G 1952 The chi square goodness of fit test Annals of Mathematical Statistics 23 3 15 345 Copyright 2010 2014 PQStat Software All rights reserved 314 REFERENCES gl 21 Cochran W G and Cox G M 1957 Experimental designs 2nd 4 New York John Wiley and Sons 22 Cohen J 1960 A coefficient of agreement for nominal scales Educational and Psychological Mea surement 10 3746 23 Cox D R 1972 Regression models and life tables Journal of the Royal Statistical Society B34 187 220 24 Cramkr H 1946 Mathematical models of statistics Princeton NJ Princeton University Press 25 Cronbach L J 1951 Coefficient alpha and the internal structure of tests Psychometrika 16 3 297 334 26 DeLong E R DeLong D M Clarke Pearson D L 1988 Comparing the areas under two or more correlated receiver operating curves A nonparametric approach Biometrics 44 837 845 27 Fishe
22. 42 a o 96s6065 0 22 266e665 0230765 1 2019265 0 2525252 0 3306270 a e7 ___s2e 13 1 s696s69 __ 0 13 2600075 0 9206714 s 114504 0 2331002 0 3360480 of ee sx an 2 ass85859 026 0 2008052 0850801103261 0 3787879 03553634 of 3 31 27 2 0 9903939 __ 0 47 o 2719296 0 0047360 s 1917990 0 356506a 0 3609022 SE E E E ooa o of a f af afpasses oziozrsass oers zasozerfoaszsooaossssee M C af E E ase ane asso ond ais f of j an sol slogan oal oases csc asst foaosoaoafoassoso a meanmna a mola ma ema oe l an eee aamen a m eel mmea a e an 1j The calculated size of the area under the ROC curve is AUC 0 889 Therefore on the basis of the adopted level 0 05 based on the obtained value p lt 0 000001 we assume that diagnosing bac Copyright 2010 2014 PQStat Software All rights reserved 215 ill 16 DIAGNOSTIC TESTS teremia with the use of the PCT indicator is indeed more useful than a random distribution of patients into 2 groups suffering from bacteremia and not suffering from it Therefore we return to the analysis button to define the optimal cut off The algorithm of searching for the optimal cut off takes into account the costs of wrong decisions and the prevalence coefficient 1 FN cost wrong diagnosis is the cost of assuming that the patient does not suffer from bac teremia although in reality he or she is suffering from it costs of a falsely negative decision 2 FP cost wro
23. 47 583 621 48 Lancaster H O 1961 Significance tests in discrete distributions Journal of the American Statisti cal Association 56 223 234 49 Lee E T Wang J W 2003 Statistical Methods for Survival Data Analysis ed third Wiley 2003 50 Levene H 1960 Robust tests for the equality ofvariance In I Olkin Ed Contributions to proba bility and statistics 278 292 Palo Alto CA Stanford University Press 51 Lilliefors H W 1967 On the Kolmogorov Smimov test for normality with mean and variance un known Journal of the American Statistical Association 62 399 402 52 Lilliefors H W 1969 On the Kolmogorov Smimov test for the exponential distribution with mean unknown Journal of the American Statistical Association 64 387 389 53 Lilliefors H W 1973 The Kolmogorov Smimov and other distance tests for the gamma distribu tion and for the extreme value distribution when parameters must be estimated Department of Statistics George Washington University unpublished manuscript 54 Lund R E Lund J R 1983 Algorithm AS 190 Probabilities and Upper Quantiles for the Studentized Range Applied Statistics 34 55 Mann H and Whitney D 1947 On a test of whether one of two random variables is stochastically larger than the other Annals of Mathematical Statistics 18 504 56 Mantel N and Haenszel W 1959 Statistical aspects of the analysis of data from retrospective studies of disease Journa
24. All rights reserved 177 14 CORRELATION Linear correlation r Pearson Statistical analysis Pearson linear correlation Y Variable 1 Variable 2 X value vanable 1 Y value variable 2 Data Filter set of the conditions that are applied to data to produce a subset of your data All the rules are combined using the logical AND basic mutipe W Add analysed data Add graph EXAMPLE 14 1 age height paqs file Among some students of a ballet school the dependence between age and height was analysed The sample consists of 16 children and the following results of these features related to the children were written down age height 5 128 5 129 5 135 6 132 6 137 6 140 7 148 7 150 8 135 8 142 8 151 9 138 9 153 10 159 10 160 10 162 Hypotheses Ho there is no linear dependence between age and height for the population of children who attend to the analysed school H there is a linear dependence between age and height for the population of children who attend to the analysed school Copyright 2010 2014 PQStat Software All rights reserved 178 Analysis time Analysed variables Significance level Size number of pairs Group name Group mean Group standard deviation Group name Group mean Group standard deviation The standard deviation of the residuals r r2 Std err of r 95 CI for r coefficient 95 CI for r coeffici
25. Data Filter set of the conditions that are applied to data to pJ produce a subset of your data All the rules are combined using the logical AND basic mutple Report options E Add analysed data More results V Add graph Yor EXAMPLE 12 2 chocolate bar pas file Quarterly sale of some chocolate bar was measured in 14 randomly chosen supermarkets The study was Started in January and finished in December During the second quarter the billboard campaign was in full swing Let s check if the campaign had an influence on the advertised chocolate bar sale Copyright 2010 2014 PQStat Software All rights reserved 159 Hypotheses Ho there is a lack of significant difference in sale values in the compared quarters in the population represented by the whole sample H the difference in sale values between at least 2 quarters is significant in the population represented by the whole sample Analysis time Analysed variables Significance level Group name Group size Sum of the ranks for group Mean of the ranks for the group Group median Group name Group size Sum of the ranks for group Mean of the ranks for the group Group median Group name Group size Sum of the ranks for group Mean of the ranks for the group Group median Group name Group size Sum of the ranks for group Mean of the ranks for the group Group median Degrees of freedom Chi statistic adjusted for ties p
26. MORE THAN 2 GROUPS Ho Oj j H Oj A Ogi i The value of critical difference is calculated by using the following formula where Za is the critical value statistic of the normal distribution for a given significance level corrected on the number of possible simple comparisons c ii The test statistic is defined by k De cj hj N N 1 ko G D 2 y where R mean of the ranks of the j th group for j 1 2 k The test statistic asymptotically for large sample sizes has the normal distribution and the p value is corrected on the number of possible simple comparisons c The settings window with the Kruskal Wallis ANOVA can be opened in Statistics menu NonParametric tests ordered categories Kruskal Wallis ANOVA or in Wizard Kruskal Wallis ANOVA Statistical analysis Kruskal Wallis one way Analysis of Vanance w Variable Grouping variable 1 Degree of weed infestation Test options 2 Field number 2 Field number Use the grouping variable Data Filter vanable condition value 3 Sowing firs 1 i Report options Add analysed data More results Add graph Copyright 2010 2014 PQStat Software All rights reserved 157 STM 12 COMPARISON MORE THAN 2 GROUPS D The Friedman repeated measures analysis of variance by ranks the Friedman ANOVA was described by Friedman 1937 33 This test is used when the measurements of an analy
27. Support UTF 8 character encoding recommended for use on computers with a small amount of memory Copyright 2010 2014 PQStat Software All rights reserved 312 22 2 SETTINGS l Displayed values precision Exponential notation tor p value Report name in the navigation tree Default significance level for testing When program start open Support for Multithreading Decimal separator from the system settings Madmum number of undo steps in sheet The maximum number of cells to remember in one step Check for updates when stat POStat Make text and other items of interface larger 22 OTHER NOTES test name test name time test name description test name filtr test name grouping vanable test name vanables Copyright 2010 2014 PQStat Software All rights reserved 313 REFERENCES W 1 Abdi H 2007 Bonferroni and Sidak corrections for multiple comparisons in N J Salkind ed Encyclopedia of Measurement and Statistics Thousand Oaks CA Sage References 2 Agresti A Coull B A 1998 Approximate is better than exact for interval estimation of binomial proportions American Statistics 52 119 126 3 Altman D G Bland J M 1983 Measurement in medicine the analysis of method comparison studies The Statistician 32 307 317 4 Anscombe F J 1981 Computing in Statistical Science through APL Springer Verlag New York 5 Armitage P Berry G 1
28. als distribution and the normal distribution the value p lt a can impair the evaluation of the significance of the coefficients of particular variables in the model e Homoscedasticity homogeneity of variance To check if there are areas in which the variance of model residuals is increased or decreased we use the charts of the residual with respect to predicted values the square of the residual with respect to predicted values the residual with respect to observed values the square of the residual with respect to observed values e Autocorrelation of model residuals For the constructed model to be deemed correct the values of residuals should not be corre lated with one another for all pairs e ej The assumption can be checked by by computing the Durbin Watson statistic n 2 d lt De e _ et 1 n 2 areas To test for positive autocorrelation on the significance level we check the position of the statis tics d with respect to the upper dz and lower dz a critical value If d lt d a the errors are positively correlated If d gt dY the errors are not positively correlated If dLa lt d lt dy the test result is ambiguous To test for negative autocorrelation on the significance level a we check the position of the value 4 d with respect to the upper du a and lower dz a critical value If 4 d lt dL a the errors are negatively correlated Copyright 2010 20
29. by Agresti and Coull 1998 2 Clopper and Pearson 1934 18 intervals are more adequate for small sample sizes Comparison of interval estimation methods of a binomial proportion was published by Brown L D etal 2001 15 The settings window with the Z test for one proportion can be opened in Statistics menu NonParametric tests unordered categories gt Z for proportion Copyright 2010 2014 PQStat Software All rights reserved 98 10 COMPARISON 1 GROUP k Z for proportion Statistical analysis Z test for one proportion 1 number of served dinners dunng one Report options _ Add analysed data Add graph _ Bo EXAMPLE 10 2 cont dinners pqs file Assume that you would like to check if on Friday of all the dinners during the whole week are served For the chosen sample m 20 n 150 T acted probability day of the week 150 0 2 Monday 150 0 2 Tuesday 150 0 2 Wednesday 150 0 2 Thursday 150 0 2 Friday Select the options of the analysis and activate a filter selecting the appropriate day of the week Friday If you do not activate the filter no error will be generated only statistics for given weekdays will be calculated Hypotheses Ho on Friday in a school canteen there are served 2 out of all dinners which are served within a week H on Friday in a school canteen there are significantly mo
30. distribution Binominal test for one proportion The binominal test for one proportion uses directly the binominal distribution which is also called the Bernoulli distribution which belongs to the group of discrete distributions such distributions where the analysed variable takes in the finite number of values The analysed variable can take in k 2 values The first one is usually definited with the name of a success and the other one with the name of a failure The probability of occurence of a success distinguished probability is p and a failure 1 po The probability for the specific point in this distribution is calculated using the formula P m oe po where m e m m n m m frequency of values distinguished in the sample n sample size Based on the total of appropriate probabilities P a one sided and a two sided p value is calculated and a two sided p value is defined as a doubled value of the less of the one sided probabilities The p value is compared with the significance level a fp lt a gt reject Ho and accept H fp gt a gt there is no reason to reject Ho Note Note that for the estimator from the sample which in this case is the value of the p proportion a confidence interval is calculated The interval for a large sample size can be based on the normal distri bution so called Wald intervals The more universal are intervals proposed by Wilson 1927 86 and
31. gt P2 Wo pees Pr W The test statistic is defined by 2 1 9 fon em a 2 He This statistic asymptotically for large expected frequencies has the x distribution with 1 degree of freedom The p value designated on the basis of the test statistic is compared with the significance level a Copyright 2010 2014 PQStat Software All rights reserved 118 11 COMPARISON 2 GROUPS ifp lt a gt reject Ho and accept H ifp gt a gt there is no reason to reject Ho The settings window with the Chi square test for trend can be opened in Statistics menu NonPara metric tests ordered categories gt Chi square for trend or in Wizard Chi square for trend Statistical analysis Chi square test for teed SS Vanable 1 1devel of commitment Data Filter set of the conditions that are applied to data to co produce a subset of your data All the rules are combined using the logical AND basic multiple Report options W Add analysed data iW Add graph 7 Add percentages EXAMPLE 11 5 viewers pas file Because of the decrease in people watching some particular soap opera there was carried out an opin ion survey 100 persons were asked who has recently started watching this soap opera and 300 per sons were asked who has watched it regularly from the beginning They were asked about the level of preoccupation with the character s life The results are written down i
32. if the differences between means lt CD gt there is no reason to reject Ho ii Comparing the p value designated on the basis of the test statistic of the proper POST HOC test with the significance level a fp lt a reject Ho and accept H ifp gt a gt _ there is no reason to reject Ho The LSD Fisher test For simple and complex comparisons equal size groups as well as unequal size groups i The value of critical difference is calculated by using the following formula CD Foatdwe Copyright 2010 2014 PQStat Software All rights reserved 146 12 COMPARISON MORE THAN 2 GROUPS D where Fa 1 dfwg is the critical value statistic of the F Snedecor distribution for a given significance level a and degrees of freedom adequately 1 and dfwa ii The test statistic is defined by k i l ee ko 2 MSwe t The test statistic has the t Student distribution with dfwg degrees of freedom The Scheffe test For simple comparisons equal size groups as well as unequal size groups i The value of a critical difference is calculated by using the following formula CD V Fa dfgc diwe l where Fa dfga dfwg S the critical value statistic of the F Snedecor distribution for a given significance level a and dfgg and dfwa degrees of freedom ii The test statistic is defined by ar ajz i 2 Cj ara Zi Mwe The test statistic has the F Snedecor distribution with df gg and
33. series are read on the upper axis X and the right axis Y The manner of interpretation of vectors that is the first series has been discussed with the previous graph In order to understand the interpretation of points let us focus on flowers number 33 34 and 109 Flowers number 33 and 34 are similar the distance between points 33 and 34 is small For both points the value of the first component is much greater than the average and the value of Copyright 2010 2014 PQStat Software All rights reserved 274 18 DIMENSION REDUCTION AND GROUPING the second component is much smaller than the average The average value i e the arithmetic mean of both components is 0 i e it is the middle of the coordination system Remembering that the first component is mainly the size of the petals and the second one is mainly the width of the sepal we can say that flowers number 33 and 34 have small petals and a large width of the sepal Flower number 109 is represented by a point which is at a large distance from the other two points It is a flower with a negative first component and a positive although not high second component That means the flower has relatively large petals while the width of the sepal is a bit smaller than average Similar information can be gathered by projecting the points onto the lines which extend the vectors of original values For example flower 33 has a large width of the sepal high and positive values on the project
34. 0 The most similar representatives are no 1 and no 2 and the least similar ones are no 1 and no 3 Jaccard s similarity of representative 1 and representative 2 is 0 857143 which means that the 2 species share a little above the 85 Jaccard s similarity of representative 1 and representative 3 is 0 375 which means that the 2 species share above 37 Jaccard s similarity of representative 1 and representative 3 is 0 428571 which means that the 2 species share above 43 Similarity matrix options are used for selecting the manner in which the elements of the matrix ought to be returned By default all elements of the matrix are returned in the form in which they have been calculated according to the accepted metric We can change it by setting Matrix elements minimum means that in each row of the matrix only the minimum value and the value on the main diagonal will be displayed maximum means that in each row of the matrix only the maximum value and the value on the main diagonal will be displayed k of the minimum means that in each row of the matrix as many smallest values will be displayed as indicated by the user who gives the k value and the value on the main diagonal k of the maximum means that in each row of the matrix as many greatest values will be displayed as indicated by the user who gives the k value and the value on the main diagonal elements below d means that in each row of the matrix only those elements will
35. 0 922483 0 850974 0 829059 3 prod_c advwert_c popular_au t stat 0 874791 3 106033 2 902106 0 974245 1 013069 3 649047 p value 0 387845 0 000013 0 000003 0 336816 0 318183 0 000874 Copyright 2010 2014 PQStat Software All rights reserved 8 072536 0 918016 0 842753 0 629649 0 937692 b stand b stand er 0 422818 0 461287 0 066242 0 067478 0 261561 0 082807 0 082669 0 067993 0 066606 0 071679 242 17 MULTIDIMENSIONAL MODELS b error t stat p value b stand b stand er 3 277567 2 205704 15 498215 2 70045 0 010466 prod_c 2 519802 0 492761 1 520396 3 519209 5 113431 0 000011 0 416063 0 081367 advert_c 2 074037 0 350655 1 362876 2 785198 5 914752 0 000001 0 478786 0 080948 popular_at 9 921455 2 771625 4 30034 15 54257 3 579655 0 001007 0 255578 0 071397 It turns out that there is no basis for thinking that the full model is better than the reduced model the value p of F test which is used for comparing models is p 0 401345 Additionally the reduced model is slightly more adequate than the full model for the reduced model R adj 0 82964880 for the full model R24 0 82905898 e Automatic model comparison In the case of automatic model comparison we receive very similar results The best model is the one with the greatest coefficient Rig and the smalles standard estimation error S Ee The best model we suggest is the model containing only 3 independent variables the produ
36. 1 The settings window with the t test for independent groups can be opened in Statistics menu Parametric tests t test for independent groups or in Wizard df Copyright 2010 2014 PQStat Software All rights reserved 104 11 COMPARISON 2 GROUPS stl t test for independent groups Statistical analysis Hest for independent groups Vanable Set of the conditions that are applied to data to pa produce a subset of your data All the rules are combined using the logical AND J basic multiple Vanances Report options Equal E Add analysed data F Diferent Add graph a If in the window which contains the options related to the variances you have choosen equal the t test for independent groups will be calculated different the t test with the Cochran Cox adjustment will be calculated check equality to calculate the Fisher Snedecor test basing on its result and set the level of significance the t test for independent groups with or without the Cochran Cox adjustment will be calculated Note Calculations can be based on raw data or data that are averaged like arithmetic means standard devi ations and sample sizes EXAMPLE 11 1 age pqs file There is an experiment in which 100 people have been chosen randomly from the population of work ers of 2 different transport companies There are 50 people chosen from each company Before the experiment begins you should check if the
37. 1901 where 1 means the presence of a given characteristic and O means the absence of it ee ij 0 object 2 ae Olle d b The Jaccard distance is expressed with the formula d x1 2 Lay 7 where J Jaccard s similarity coefficient a a b c Jaccard s similarity coefficient is within the range 0 1 where 1 means the highest and O the lowest similarity The distance dissimilarity is interpreted in the opposite manner 1 means that the compared objects are dissimilar and O that they are very similar The meaning of Jaccard s similarity coefficient can be illustrated very well by the situation of clients choosing products The fact of the purchase of a given product by a client will be marked with 1 and the fact of not purchasing the product by 0 When calculating Jaccard s coefficient we will compare 2 products so as to learn how many clients buy them together We are not off course interested in the clients who did not buy any of the compared products What we are interested in is how many people who bought one of the compared products also bought the other one The sum a b cis the number of clients who bought one of the compared products and a is the number of customers who bought both products The higher the coefficient the more interrelated the purchases the purchase of one product is accompanied by the purchase of the other one The opposite is true if we obtain a high Jaccard s dissimilarity coefficient Suc
38. 2014 PQStat Software All rights reserved 303 19 SURVIVAL ANALYSIS B coeff B error 95 CI 95 CI Wald stat p value Hazard rat 95 CI 95 CI log WBC 1 6389171 0 5190378 0 6216215 2 6562126 9 9704762 0 0015907 5 1495899 1 8619449 14 242245 Rx 1 6590474 0 7291016 0 43500345 3 2660603 6 5013690 0 0107791 6 4176205 1 5373105 276 790850 Copyright 2010 2014 PQStat Software All rights reserved 304 20 RELIABILITY ANALYSIS 20 RELIABILITY ANALYSIS Reliability analysis is usually associated with the complex scale construction in particular summary scales these consist of many individual items Reliability analysis associated as its internal consistency informs us to what extent a particular scale measures what it should measure In other words to what extend the scale items measure the things that are measured by the whole scale When every scale item measures the same construct the correlation between the items should be high we can Call it reliable scale This assumption can be checked by calculating the matrix of the Pearson s correlation coefficient Many measures of concordance can be used in reliability analysis However the most popular technique is the a Cronbach coefficient and so called split half reliability Cronbach s a coefficient was named for the first time in 1951 25 by Cronbach It measures the proportion of single item variances a and the whole scale variance items sum It is calculated accor
39. 3 WORKING WITH DOCUMENTS Increment a value which is supposed to be the difference between the following gener ated data e To generate random numbers Lower limit beginning of the interval from which the values will be randomised Upper limit end of the interval from which the values will be randomized e To generate random values from the distribution you should choose the sort of distribution Normal distribution Chi square distribution and then write its parameters The amount of generated data depends on the value you put in the Count field but the precision de pends on settings of the Decimal places field Data will be put up or put down starting with an active cell it depends on a selected option At the end confirm your choice by clicking Run 3 1 11 MISSING DATA In studies we very often see missing data That is especially to be expected in the case of survey data There are situations in which the missing data gives valuable information For example the number of missing data in answer to a question concerning preferences with regard to political parties informs us about the number of undecided citizens who do not favor or do not admit they do particular po litical groups Small amounts of missing data do not constitute a problem in statistical analyses Large amounts however can undermine the reliability of the conducted research It is worth taking care that there are as few such lacks as possible from th
40. 932064 1 101056 0 357743 2 009229 1 961266 0 662662 1 343489 2 221106 0 950314 1 002641 1 652324 1 174161 7 69222 9 956516 1 014992 The quality of model goodness of fit is not high RZ 0 11 Re T S t 0 14 At the same time the model is statistically significant value p lt 0 000001 of the Likelihood Ratio test which means that a part of the independent variables in the model is statistically significant The result of the Hosmer Lemeshow test points to a lack of significance p 0 2753 However in the case of the Hosmer Lemeshow test we ought to remember that a lack of significance is desired as it indicates a similarity of the observed sizes and of predicted probability An interpretation of particular variables in the model starts from checking their significance In this case the variables which are significantly related to the occurrence of the anomaly are Copyright 2010 2014 PQStat Software All rights reserved 251 17 MULTIDIMENSIONAL MODELS Sex p 0 0063 BirthWeight p 0 0188 PregNo p 0 0035 RespTinf p lt 0 000001 Smoking p 0 0003 The studied congenital anomaly is a rare anomaly but the odds of its occurrence depend on the variables listed above in the manner described by the odds ratio e variable Sex OR 95 C T 1 60 1 14 2 22 the odds of the occurrence of the anomaly in a boy is 1 6 times greater than in a girl e variable BirthWeight OR 95 C J 0 74 0 57 0
41. B a 2 Var s 7 yo Vare where E S is solution to the quadratic equation E Of 09 Eo yo p AL A EC Op a a MM ORwH Ol 09 BO off off BO 1 Var ah a t e t oct Bo oF olyra oo orogr This statistic asymptotically for large frequencies has the y distribution with the number of degrees of freedom calculated using the formula df w 1 The p value designated on the basis of the test statistic is compared with the significance level Q fp lt a gt reject Ho and accept H ifp gt a gt _ there is no reason to reject Ho Copyright 2010 2014 PQStat Software All rights reserved 169 13 STRATIFIED ANALYSIS EXAMPLE 13 1 leptospirosis pqs file The following table presents hypothetical poll results conducted among inhabitants of a city and vil lage the village is treated as a risk factor in West India The poll aim was to detect risk factors of leptospirosis 9 The occurrence of leptospirosis antibodies is a indirect evidence about infection Observed frequencies leptospirosis antibodies 0 uai 60 240 place of residence urban 60 140 The odds of the occurrence of leptospirosis antibodies among inhabitants of the city and the village is the same OR 1 Let s include gender in the analysis and check what odds will be then The sample has to be divided into 2 stratas because of gender they are marked in a file as a saved selection leptospirosis antibo
42. Clear selections You can use a data filter Data filter is an option which is available when you choose any statistical analysis If you turn the filter on the number of rows that are taken into account during the analysis is reduced There are 2 possible filters basic filter and multiple filter e Basic filter uses one or more rules joined with conjunctions or alternative EXAMPLE 4 6 Basic filter filter pqs file You want to calculate descriptive statistics for girls height who are between 10 and 15 years old Choose Descriptive statistics from Statistics menu In the descriptive statistics options window you should select all the procedures you want to have done for example you select mean standard deviation minimum and maximum and variable for analysis column which includes height To set filter you need to add rules using Ei button First you need to set the rule for the variable sex Then choose equal sign as a condition and g letter which means girls as a value After that you should add another rule and set the the variable age Then gt sign as a condition and 10 as a value Exactly the same way you add age condition lt 15 Note to do this task properly all the rules of the filter should be joined with conjunction the ano sign informs you about it If you select analysis conditions properly confirm your choice by clicking OK Descriptive statistics multiple Statistical analysis Des
43. Ctrl Down on the name of the report in the Navigation tree Adding information to the report name in Navigation tree such as the hour of generation description filter the name of the grouping variable the name of the variable it is possible after selecting an appropriate option in the window of program settings 3 3 HOW TO CHANGE LANGUAGE SETTINGS IN PQSTAT Both created reports and program interface can be changed into Polish and English To change the language you need to click Edition Language Jezyk Reports opened after the switch will be trans lated automatically except the procedure name which is the description and is subjected to the user edition Copyright 2010 2014 PQStat Software All rights reserved 36 3 WORKING WITH DOCUMENTS sl 3 4 MENU File menu New project Ctrl N Add datasheet Ctrl D Open project Ctrl O Open recent Open examples Import from Save Ctrl S Save as Close project Print Close Ctrl Q to close the program Edit menu Undo Ctrl Z Cut Ctrl X Copy Ctrl C Paste Ctrl V Delete Del Select all Ctrl A Find Replace Ctrl F Column format Ctrl F10 Activate Deactivate filter Activate all Save selection Ctrl T Clear selections Language Jezyk Settings Data menu Create table Create raw data Copyright 2010 2014 PQStat Software All rights reserved 37 3 WORKING WITH DOCUMENTS sl Sort Formulas Generate Missin
44. Filter Set of the conditions that are applied to data to co Produce a subset of your data All the rules are combined using the logical AND basic 6 multiple Report options Group 1 Kind of comparison E Add analysed data fi neal i manual Add graph signticance P o oa O aoa VES Due to the possibility of simultaneous analysis of many independent variables in one logistic regression model similarly to the case of multiple linear regression there is a problem of selection of an optimum model When choosing independent variables one has to remember to put into the model variables strongly correlated with the dependent variable and weakly correlated with one another When comparing models with various numbers of independent variables we pay attention to goodness 2 2 2 of fit of the model Rb seudo Lt ningetherker Ror sne For each model we also calculate the maximum of likelihood function which we later compare with the use of the Likelihood Ratio test Hypotheses Ho Leu LRM H Lry LRM where LFM Lrm the maximum of likelihood function in compared models full and reduced The test statistic has the form presented below x 2In Lrm Lrm 2In Lrm 2In LfFm The statistic asymptotically for large sizes has the x distribution with df krm krm degrees of freedom where krm i krm is the number of estimated parameters in compared models On the basis of te
45. Heanszel 1959 56 Mantel 1966 58 Cox 1972 23 Gehan s generalization of Wilcoxon s test deriving from Wilcoxon s test Breslow 1970 Gehan 1965 34 35 Tarone Ware test deriving from Wilcoxon s test Tarone and Ware 1977 76 The three tests are based on the same test statistic they only differ in weights w the particular points of the timeline on which the test statistic is based Log rank test w 1 all the points of the timeline have the same weight which gives the later values of the timeline a greater influence on the result Gehan s generalization of Wilcoxon s test w n time moments are weighted with the number of observations in each of them so greater weights are ascribed to the initial values of the time line Tarone Ware test w n time moments are weighted with the root of the number of observations in each of them so the test is situated between the two tests described earlier An important condition for using the tests above is the proportionality of hazard Hazard defined as the slope of the survival curve is the measure of how quickly a failure event takes place Breaking the principle of hazard proportionality does not completely disqualify the tests above but it carries some risks First of all the placement of the point of the intersection of the curves with respect to the time line has a decisive influence on decreasing the power of particular tests Copyright 2010 2014 PQStat Softwar
46. Lilliefors test or Kolmogorov Smirnov test can be opened in Statistics menu NonParametric tests ordered categories or in Wizard Kolmogorov Smimov Statistical analysis Kolmogorov Smirnov test w Variable Test options Calculate mean and standard deviation v Hypothetical standard deviation 1 Data Filter set of the conditions that are applied to data to pJ produce a subset of your data All the rules are combined using the logical AND basic mulipe W Report options E Add analysed data Add graph Copyright 2010 2014 PQStat Software All rights reserved 89 STM Liliefors Statistical analysis Lilliefors test Vanable 1 No level EXAMPLE 10 1 continuation courier pqs file Hypotheses 10 COMPARISON 1 GROUP Data Filter set of the conditions that are applied to data to C3 produce a subset of your data All the rules are combined using the logical AND basic multiple Report options Add analysed data Add graph Ho distribution of the number of awaiting days for the delivery which is supposed to be delivered by the analysed courier company is the normal distribution H distribution of the number of awaiting days for the delivery which is supposed to be delivered by the analysed courier company is different from the normal distribution The mean value and the standard deviation of the time of awaiting for the delivery for all the clients is not known
47. Petal Len 0 58041 0 02449 0 142121 0 80144 Petal Wid 0 56485 0 06694 0 63427 0 523597 Sepal Ler 0 89016 0 36083 0 275658 0 037606 0 460143 0 8827 1i 0 093602 0 017771 Petal Len 0 99155 0 02341 0 05444 0 11535 Petal Wid 0 96497 0 064 0 24298 0 07536 Sepal Len 27 15096 14 24440 51 77757 6 827052 Sepal Wic 7 254804 85 24748 5 972245 1 525463 Petal Len 33 68793 0 059984 2 01999 64 23208 Petal Wid 31 90629 0 448123 40 23019 27 41539 Particular original variables have differing effects on the first principal component Let us put them in order according to that influence 1 The length of a petal is negatively correlated with the first component i e the longer the petal the lower the values of that component The eigenvector of the length of the petal is the greatest in that component and equals 0 58 Its factor loading informs that the correlation between the first principal component and the length of the petal is very high and equals 0 99 which consti tutes 33 69 of the first component 2 The width of the petal has an only slightly smaller influence on the first component and is also negatively correlated with it 3 We interpret the length of the sepal similarly to the two previous variables but its influence on the first component is smaller 4 The correlation of the width of the sepal and the first component is the weakest and the sign of that correlation is positive The secon
48. The analysed feature can have only 2 categories defined here as and The McNemar test can be calculated on the basis of raw data or on the basis of a 2 x 2 contingency table Table 11 5 2 x 2 contingency table for the observed frequencies of dependent variables Observed frequencies 01 6o O wi Oi On 12 n O11 O12 O21 O22 Hypotheses Ho Oiz Oz Hy Oj Oz The test statistic is defined by 9 Oiz O21 O12 O21 This statistic asymptotically for large frequencies has the y distribution with a 1 degree of freedom The p value designated on the basis of the test statistic is compared with the significance level a fp lt a gt reject Ho and accept H ifp gt a gt there is no reason to reject Ho The Continuity correction for the McNemar test This correction is a more conservative test than the McNemar test a null hypothesis is rejected much more rarely than when using the McNemar test It guarantees the possibility of taking in all the values of real numbers by the test statistic according to the y distribution assumption Some sources give the information that the continuity correction should be used always but some other ones inform that only if the frequencies in the table are small The test statistic with the continuity correction is defined by 2 _ O12 Oai 1 O12 O21 Odds ratio of a result change If the study is carried out twice for the sa
49. The value of the test statistic can not be calculated when rp 1 or rp 1 or whenn lt 3 The test statistic has the Student distribution with n 2 degrees of freedom The p value designated on the basis of the test statistic is compared with the significance level a fp lt a gt reject Ho and accept H ifp gt a gt _ thereis no reason to reject Ho Prediction is used to predict the value of a one variable mainly a dependent variable yo on the basis of a value of an another variable mainly an independent variable xo The accuracies of a calculated value are defined by prediction intervals calculated for it Interpolation is used to predict the value of a variable which occurs inside the area for which the regression model was done Interpolation is mainly a safe procedure it is assumed only the continuity of the function of analysed variables Extrapolation is used to predict the value of variable which occurs outside the area for which the regression model was done As opposed to interpolation extrapolation is often risky and is performed only not far away from the area where the regression model was created Similarly to the interpolation it is assumed the continuity of the function of analysed variables The settings window with the Pearson s linear correlation can be opened in Statistics menu Parametric tests linear correlation r Pearson or in Wizard Copyright 2010 2014 PQStat Software
50. a growth trend of the hazard value trend in the position of the survival rates will be found Trend for many survival curves If we introduce into the test the information about the ordering of the compared categories we will use the age variable in which the age ranges will be numbered respectively 1 2 and 3 we will be able to check if there is a trend in the compared curves We will study the following hypotheses Ho a lack of a trend in the survival time curves of the patients after a transplantation a trend dependent on the age of the patients at the time of a transplantation H the older the patients at the time of a transplantation the greater smaller the probability of their survival over a given period of time 5 113469574 1 0 023740797 On the basis of the significance level a 0 05 based on the obtained value p 0 0237 in the log rank test p 0 0317 for Gehan s and p 0 0241 for Tarone Ware we conclude that the survival curves are positioned in a certain trend On the Kaplan Meier graph the curve for people aged 55 years 60 years is the lowest Above that curve there is the curve for patients aged 50 years 55 years The highest curve is the one for patients aged 45 years 50 years Thus the older the patient at the time of a transplantation the lower the probability of survival over a certain period of time Survival curves for stratas Let us now check if the trend observed before is independent of t
51. according to the x distribu tion assumption The test statistic is defined by Oi E 0 5 Bay Mei 7 j i 1 j 1 EXAMPLE 11 7 cont sex exam pqs file The p value for the y test with the Yate s correction is 0 000103 Similarly to the x test without the correction on the significance level a 0 05 the alternative hypothesis can be accepted The alter native hypothesis informs that there is a dependence between sex and exam passing in the analysed population Significantly the exam was passed more often by women 55 56 out of all the women in the sample who passed the exam than by men 25 00 out of all the men in the sam ple who passed the exam The Fisher test for 2 x 2 tables The Fisher test for 2 x 2 tables is also called the Fisher exact test R A Fisher 1934 27 1935 28 This test enables you to calculate the exact probability of the occurrence of the particular number dis tribution in a table knowing n and defined marginal sums arte ae a aa O11 012 P If you know each marginal sum you can calculate the P probability for various configurations of ob served frequencies The exact p significance level is the sum of probabilities which are less or equal to the analysed probability The p value is compared with the significance level a The settings window with the Fisher exact test mid p 2x2 can be opened in Statistics menu gt NonParametric tests unordered ca
52. amp The Z test of significance for the Cohen s Kappa 4 Fleiss 1981 30 is used to verify the hypothesis informing us about the agreement of the results of two times measurements X and X features X and it is based on the amp coefficient calculated for the sample Basic assumptions measurement on a nominal scale alternatively an ordinal or an interval Hypotheses Ho A 0 Hy gt R a 0 Copyright 2010 2014 PQStat Software All rights reserved 203 15 AGREEMENT ANALYSIS A The test statistic is defined by nw R Z _ SE kaiser PtP y pil pi i where SE E E standard error of a sample distribution Kdistr 1 a P n The Z statistic asymptotically for a large sample size has the normal distribution The p value designated on the basis of the test statistic is compared with the significance level a ifp lt a gt reject Ho and accept H ifp gt a gt thereis no reason to reject Ho The settings window with the test of Cohen s Kappa significance can be opened in Statistics menu gt NonParametric tests unordered categories Cohen s Kappa or in Wizard Statistical analysis Test of the Cohen s Kappa significance Data Filter set of the conditions that are applied to data to ae produce a subset of your data All the rules are combined using the logical AND hasc mutiple W Add analysed data Add graph Add percentages Sums EXAMPLE 15 3 diagnosis p
53. application is 0 05 significance opened in the window of the chosen test Copyright 2010 2014 PQStat Software All rights reserved 47 5 GRAPHS 5 GRAPHS The PQStat program offers column charts error charts box plots point charts and line and point charts The window with the settings of the the options of graphs is called up via the menu Graphs The change of the basic parameters of the graph is possible directly in the graph window If we want to change the general graph parameters such as titles backgrounds axes grid lines or the legend we choose the tab Graph General Options we want to change the appearance of the drawn object e g the shape style colors we choose the tab Graph Detailed Options we want to draw additional elements e g line we choose the tab Others The graphs presenting the results of statistical analyses are available in the window of the selected Statistical analysis at the option Add graph The graph is returned to the report where it can be saved option Save Graph as from the context menu printed option Print Graph from the context menu copied option Copy Graph from the context menu edited this applies to the Graph General Options and Graph Detailed Options To edit a graph it is enough to double click on the graph or to choose the option Edit Graph from the context menu In the edition window it is also possible to save the graph at
54. average age of both companies workers is similar because another step in the experiment depends on this The age of each participant is written using years Age company 1 27 33 25 32 34 38 31 34 20 30 30 27 34 32 33 25 40 35 29 20 18 28 26 22 24 24 25 28 32 32 33 32 34 27 34 27 35 28 35 34 28 29 38 26 36 31 25 35 41 37 Age company 2 38 34 33 27 36 20 37 40 27 26 40 44 36 32 26 34 27 31 36 36 25 40 27 30 36 29 32 41 49 24 36 38 18 33 30 28 27 26 42 34 24 32 36 30 37 34 33 30 44 29 The age distribution in both groups is a normal one it was tested with the Lilliefors test with the mean of z 30 26 and the standard deviation of sd 5 23 for the first group and 2 32 68 and sd 6 36 for the second group The Fisher Snedecor test also indicates that the variances of the Copyright 2010 2014 PQStat Software All rights reserved 105 11 COMPARISON 2 GROUPS age in both companies are equal p value 0 176168 It means that all assumptions of the t test for independent groups are fulfilled Hypotheses Ho the mean of the age of the first company workers is the same as the mean of the second company workers age H the mean of the age of the first company workers differs from the mean of the second company workers age Analysis time Analysed variables Significance level Correction for different variances Grouping var
55. between variables does not always show the cause and effect relationship Copyright 2010 2014 PQStat Software All rights reserved 174 14 CORRELATION 14 1 PARAMETRIC TESTS 14 1 1 THE LINEAR CORRELATION COEFFICIENTS The Pearson product moment correlation coefficient r called also the Pearson s linear correlation co efficient Pearson 1896 1900 is used to decribe the strength of linear relations between 2 features It may be calculated on an interval scale only if the distribution of the analyed features is a normal one De a yo F O w 7 rp where i Yi the following values of the feature X and Y z y means values of features X and Y n sample size Note Ry the Pearson product moment correlation coefficient in a population Tp the Pearson product moment correlation coefficient in a sample The value of rp E lt 1 1 gt and it should be interpreted the following way e Tp amp 1 means a strong positive linear correlation measurement points are closed to a straight line and when the independent variable increases the dependent variable increases too e rp 1 means a strong negative linear correlation measurement points are closed to a Straight line but when the independent variable increases the dependent variable decreases e if the correlation coefficient is equal to the value or very closed to zero there is no linear de pendence between the analysed fea
56. closed by File menu gt Close project Close project button in the Project Manager To navigate the project easily you can use a Project Manager that is opened when you select appro priate project In this window you can both save and delete projects You are also able to delete datasheets and reports or to add descriptions and notes Project Name is also the name of the project file pqs pqx Copyright 2010 2014 PQStat Software All rights reserved 6 PQStat v 1 4 6 document not saved File Edit Data Statistics Spatial analysis Dem has Frequency tables Data 2 Descriptive statistics Descriptive statistics Mann Whitney sex 1 0 Mann Whitney sex 1 0 Mann Whitney sex 1 0 Logistic Regression Graphs Help Vog A Nearest Neighbor Analysis Queen Imm Copyright 2010 2014 PQStat Software All rights reserved documento saved Le creation date a Data 1 SHP districts shp 2012 12 9 21 02 00 Note E Nearest Neighbor Analysi Nearest Neighbor Analysis 2012 12 9 21 02 11 Note F Frequency tables Frequency tables 2012 12 9 21 02 18 Note m Data 2 z XLS database_numbers EN_database_import xis 2012 12 9 21 02 34 Note Descriptive statistics Descriptive statistics 2012 12 9 21 02 44 Note Descriptive statistics Descriptive statistics 2012 12 9 21 02 44 Note Mann Whitney Mann Whitney U
57. contains one free word d the observed number of failure events in models other than Cox s n i e sample size is used instead of d e Information criteria are based on the information entropy carried by the model model in security i e they evaluate the lost information when a given model is used to describe the studied phenomenon We should then choose the model with the minimum value of a given information criterion AIC AIC c and BIC is a kind of a compromise between the good fit and complexity The second element of the sum in formulas for information criteria the so called penalty func tion measures the simplicity of the model That depends on the number of parameters k in the model and the number of complete observations d In both cases the element grows with the increase of the number of parameters and the growth is the faster the smaller the number of observations The information criterion however is not an absolute measure i e if all the compared models do not describe reality well there is no use looking for a warning in the information criterion Akaike information criterion AIC 2 InLry 2k It is an asymptomatic criterion appropriate for large sample sizes Corrected Akaike information criterion 2k k 1 AICc AI ____ Ce C TF Because the correction of the Akaike information criterion concerns the sample size the number of failure events it is the recommended measure also
58. correlation coefficient so the strength of a monotonic relation between age and height counts to r 0 8397 14 2 3 The test of significance for the Kendall s tau correlation coefficient The test of significance for the Kendall s 7 correlation coefficient is used to verify the hypothesis de termining the lack of monotonic correlation between analysed features of population It is based on the Kendall s tau correlation coefficient calculated for the sample The closer to O the value of 7 is the weaker dependence joins the analysed features Basic assumptions Copyright 2010 2014 PQStat Software All rights reserved 186 14 CORRELATION measurement on an ordinal scale or on an interval scale Hypotheses Ho 7T O0 Hi T 0 The test statistic is defined by 3 y n n 1 V22n 5 The test statistic asymptotically for a large sample size has the normal distribution The p value designated on the basis of the test statistic is compared with the significance level a fp lt a gt reject Ho and accept H fp gt a gt there is no reason to reject Ho The settings window with the Kendall s monotonic correlation can be opened in Statistics menu gt NonParametric tests ordered categories gt monotonic correlation tau Kendall or in Wizard Monotonic correlation tau Kendall Statistical analysis Kendall s monotonic correlation Variable 1 Variable 2 Data Filter set of the co
59. cost the sum of discounts made and the author s popularity However not all of those variables need to have a significant effect on profit Let us try to select sucha model of linear regression which will contain the optimum number of variables from the perspective of statistics e Manual model comparison On the basis of the erlier constructed full model we can suspect that the variables direct pro motion costs and the sum of discounts made have a small influence on the constructed model i e those variables do not help predict the greatness of the profit We will check if from the perspective of statistics the full model is better than the model from which the two variables have been removed Analysis time Analysed variables gross_profit prod_cradvert_ Significance level 0 05 Number of variables in the model 1 5 Analysed variables prod_cradvert_c prom_c ret Standard error of estimation R R Adjusted R2 Number of variables in the model 2 Analysed variables Standard error of estimation R R Adjusted R2 F comparing models DF1L DFZ p value intercept prod_c advert_c prom_c rebates popular_at b coeff 4 175166 2 560709 1 998235 4 568238 1 423171 10 153717 b error 4 772779 0 501507 0 359065 4 791644 1 404811 2 762567 95 CI 5 544268 1 541525 1 268527 5 069555 1 431749 4 498861 95 CI 13 874639 3 579894 2 727943 14 40603 4 278091 15 808574 8 086501
60. criterion then sorting is performed according to column variables sequences placed in a Sequence box 3 1 7 HOW TO CONVERT RAW DATA INTO CONTINGENCY TABLE You can start the operation of converting raw data into a contingency table by selecting Create table from Data menu Usually there is the whole data sheet available for this operation default However if you start the conversion from selecting a piece of data you will be able to reduce the area available only to the selection A contingency table can be designed by selecting the variables forming row and column labels If a preview of the table does look like the expected one you confirm the choice by selecting Run The returned result will be placed in a new datasheet Copyright 2010 2014 PQStat Software All rights reserved 15 STM Convert into contingency table only in the selected area 3 education Variable for column labels Variable for row 1 no 3 education 4 R 3 1 8 HOW TO CONVERT CONTINGENCY TABLE INTO RAW DATA You can start the operation of converting a contingency table into raw data by selecting Create raw data from Data menu In the window of data transformation we enter appropriate numbers and headers of rows and columns You confirm the choice by selecting Run The returned result will be placed in a new datasheet with headers If we convert a table which is placed in a datasheet we have to select it wi
61. degrees of freedom dfgs n 1 between subjects degrees of freedom N k n sample size zij values of the variable from 7 subjects i 1 2 n in j conditions j 1 2 k The test statistic has the F Snedecor distribution with df gc and dfres degrees of freedom The p value designated on the basis of the test statistic is compared with the significance level a fp lt a gt reject Ho and accept H fp gt a gt there is no reason to reject Ho The POST HOC tests Introduction to the contrasts and the POST HOC tests was performed in the 12 1 2 unit which relates to the one way analysis of variance The LSD Fisher test For simple and complex comparisons frequency in particular measurements is always the same Hypotheses Example simple comparisons comparison of 2 selected means Ho Hj Hj H Hj F Hj i The value of the critical difference is calculated by using the following formula CD Jf aidie Copyright 2010 2014 PQStat Software All rights reserved 153 12 COMPARISON MORE THAN 2 GROUPS D where Fa 1 df es is the critical value statistic of the F Snedecor distribution for a given sig nificance level a and degrees of freedom adequately 1 and dfpres ii The test statistic is defined by k ae Cj j k pa z M Sres t The test statistic has the t Student distribution with dfres degrees of freedom The Scheffe test For simple comparison
62. does change ORDEHCT Primary education reference category Vocational education 0 51 0 26 0 99 Secondary education 0 42 0 22 0 80 Tertiary education 0 45 0 22 0 92 The odds of the occurrence of the studied anomaly in each education category is always compared with the odds of the occurrence of the anomaly in the case of primary education We can see that for more educated the mother the odds is lower For a mother with e vocational education the odds of the occurrence of the anomaly in a child is 0 51 of the odds for a mother with primary education e secondary education the odds of the occurrence of the anomaly in a child is 0 42 of the odds for a mother with primary education e tertiary education the odds of the occurrence of the anomaly in a child is 0 45 of the odds for a mother with primary education EXAMPLE 17 3 task pqs file An experiment has been made with the purpose of studying the ability to concentrate of a group of Copyright 2010 2014 PQStat Software All rights reserved 254 17 MULTIDIMENSIONAL MODELS sl adults in an uncomfortable situation 130 people have taken part in the experiment Each person was assigned a certain task the completion of which requried concentration During the experiment some people were subject to a disturbing agent in the form of temperature increase to 32 degrees Celsius The participants were also asked about their address of residence sex age and education The t
63. dsa naunan p Go 66 6444 oreo 280 19 3 COMPARISON OF SUVIVAL CURVES 2 0 2c ee eee ee ee ee ee 282 19 3 1 Differences among the survival curves saoao ea a 284 19 3 2 Survival curve trend saasaa ee 285 19 3 3 Survival curves for the stratas oaoa ea a a 285 19 4 PROPORTIONAL COX HAZARD REGRESSION 1 1 1 a ee ee 292 19 4 1 H zard TANG gt csar eRe ew KE RE R EDR OSS RO ww eS 294 19 4 2 Model verification 1 ee 294 19 4 3 Analysis of model residuals oaa ee 296 19 5 COMPARISON OF COX PH REGRESSION MODELS 2 a ee ee 297 20 RELIABILITY ANALYSIS 305 21 THE WIZARD 311 Copyright 2010 2014 PQStat Software All rights reserved 3 CONTENTS 22 OTHER NOTES 312 Pid VLE 6b ke BP CER we RR ol oe ew OO RE a Oe Oe OH ee a 312 Pie SOUS ss narani na RHEE Re RD Bo RA ERR HD we BO 313 Copyright 2010 2014 PQStat Software All rights reserved 4 2 HOW TO INSTALL gl 1 SYSTEM REQUIREMENTS To use PQStat your computer must meet the following minimum requirements Processor Intel Pentium II 500 MHz or better 256 MB RAM or greater SVGA 800 x 600 16 bit colour or better 200 MB of disc space The alternate install CD only requires you to have CD ROM Other requirements a keyboard a mouse Supported Operating Systems Windows 2000 XP Vista 7 8 2 HOW TO INSTALL To start the installation process run the application installer PQStat setup_x86 FULL for 64 bit version PQSt
64. e percentage they change the number into a percentage by multiplying by 100 and displaying it with the symbol as in the case of the numerical format it is possible to set the number of the decimals e currency used for money values allows to add the symbol of a currency as in the case of the numerical format it is possible to set the number of the decimals e range marked with the use of the upper and lower boundary as in the case of the numerical format it is possible to set the number of the decimals e formula values calculated according to the formula ascribed to the column the values are au tomatically recalculated when any of the entry data is changed When a new sheet is opened there is a standard default format for each cell In a default format the sheet supports cell content automatically A whole header row is set permanently of the text format You can set defined formats for the rest of the sheet Only a whole column can be formatted except for its header not a single cell To set a column format you should select Format in a context menu of the number displayed above a column header Edit Column format when an active cell identifies the proper column g Column Format Format variables Decimal places 2 format automatically detected on the basis of cell contents a toon You can define the width of a column by using a mouse arrow In order to do this you should move the line wh
65. exam There are some data below ina form of raw rows and all the data in the form of a contingency table Check if both surveys give the similar results Study 1 Study 2 have no opinion positive postive negative positive negative positive postive negative positive have no opinion negative 50 4 3 positive 44 D4 2 have no opinion 35 18 af Hypotheses Ho the number of students who changed their opinions is exactly the same for each of the possibile symmetric opinion changes H the number of students who changed their opinions is different for at least one of the possibile symmetric opinion changes where for example changing the opinion from positive to negative one is symmetrical to changing the opinion from negative to positive one Analysis time I hawe no negative positive Analysed variables Study 1 Study 2 I have no 18 Significance level oo negative 50 Continuity correction Ad positive Size number of pairs Chi square statistic Degrees of freedom I hawe no negative p value lt 0 0000001 I have no 14 8 14 negative 1 2 20 positive 2 17 6 Copyright 2010 2014 PQStat Software All rights reserved 139 STAT 11 COMPARISON 2 GROUPS SMA 50 PI have no opinion lt gt negative B have no opinion lt positive B negative lt gt positive 10 below main diagonal above main diagonal Comparing the p value for the Bowker test p value lt 0 000001 wit
66. for testing for bacteremia The most useful and reliable parameters for screening and monitoring bacterial infections are the following indicators WBC the number of white blood cells PCT procalcitonin It is assumed that in a healthy infant or a small child WBC should not exceed 15 thousand jl and PCT should be lower than 0 5 ng ml The sample values of those indicators for 136 children of up to 3 years old with persistent fever gt 39 C is presented in the table fragment below PCT bacteremia Sex 0 023 no F 0 022 no F 0 005 no F 0 004 no F 6 10 0 006 no F 12 50 0 031 no F 4 90 0 002 no F 6 90 0 011 no F 11 60 0 025 no F 20 50 5 919 yes F 20 80 6 405 yes F F Tn NAT mn One method of analyzing the PCT indicator is transforming it into a dichotomous variable by selecting a cut off e g eat 0 5 ng ml above which the study is considered to be positive The level of ade quacy of such a division will be indicated by the value of sensitivity and specificity We want to use a more complex approach that is calculate the sensitivity and specificity not only for one value but for each PCT value obtained in the sample which means constructing a ROC curve On the basis of the Copyright 2010 2014 PQStat Software All rights reserved 214 16 DIAGNOSTIC TESTS information obtained in that manner we want to check if the PTC indicator is indeed useful for diagnos ing bacteremia If so then
67. for tertiary level education the statistically significant coefficient b 1 5 which means that the average wages of people with tertiary level education are 1 5 PLN higher than those of people with elementary education assuming that all other variables in the model remain unchanged Effect coding is used to answer with the use of multidimensional models the question How do Y results in each analyzed category differ from the results of the unweighted mean obtained from the sample The coding consists in ascribing value 1 or 1 to each category of the given variable The category coded as 1 is then the base category k 2 Ifthe coded variable is dichotomous then by placing it in a regression model we will obtain the coefficient calculated for it b The coefficient is the reference of Y for category 1 to the unweighted general mean corrected with the remaining variables in the model If the analyzed variable has more than two categories then k categories are represented by k 1 dummy variables with effect coding When creating variables with effect coding a category is selected for which no separate variable is made The category is treated in the models as a base category as in each variable made by effect coding it has values 1 When the X1 X9 X _1 variables obtained in that way with effect coding are placed in Copyright 2010 2014 PQStat Software All rights reserved 225 17 MULTIDIMENSIONAL MODELS sl a regre
68. for the difference between proportions close to those extreme values the Wald method can lead to unreliable results Newcombe 1998 65 Miettinen 1985 64 Beal 1987 7 Wallenstein 1997 79 A comparison and analysis of many methods which can be used instead of the simple Wald method can be found in Newcombe s study 1998 65 The suggested method suitable also for extreme values of proportions is the method first published by Wilson 1927 86 extended to the intervals for the difference between two independent proportions Note The confidence interval for the NNT is estimated on the basis of the Newcombe Wilson method Ben der 2001 8 Newcombe 1998 65 Wilson 1927 86 The settings window with the Z test for 2 proportions can be opened in Statistics menu NonPara metric tests ordered categories Z for 2 independent proportions Z for 2 independent proportions Statistical analysis Z test for two independent proportions Y Frequency numerator Sample size denominator v Frequency numerator Sample size denominator Data Filter Set of the conditions that are applied to data to mp produce a subset of your data All the rules are combined using the logical AND basic E multiple ano Report options Variable contains Add analysed data Frequency numerator jaa v Add graph l signiticance Copyright 2010 2014 PQStat Software All rights reserved 134 11 COMPARISON 2 GROUPS
69. fre quencies can be created an adequate table of expected frequencies table 11 2 Table 11 2 The contingency table of r x c expected frequencies Expected Feature Y frequencies Fij FeatureY Yi Yo Ye X X2 Bar Boo Boe SS ee Xr Eri Bro Ere where oe dint On X ja O15 ee dint O12 Dija1 O15 re dint OieX Vj 1 Oly n 2 n n Ep H100 Xj 03 py i1 Cin ej O25 _ Hi Oie X Daj O24 n d n 2 n i 1 Oi1 X G1 Orj 2 UR On i VieX gt OR E oi 1 L Bs D 1 V i2 D a Enc Ds 1 De n n n For the data from the example 11 4 the contingency table of expected frequencies looks like this higher primary y secondary female 9 74 4 97 male 7 20 6 71 5 03 e The contingency table of percentages calculated from the sum of columns For the data from the example 11 4 the contingency table looks like this higher primary secondary female 53 85 33 33 44 44 male 46 15 66 67 99 00o e The contingency table of percentages calculated from the sum of rows For the data from the example 11 4 the contingency table looks like this higher primary y secondary female 46 67 26 67 26 67 male 31 58 42 11 26 32 Copyright 2010 2014 PQStat Software All rights reserved 117 11 COMPARISON 2 GROUPS SMA e The contingency table of the percentages calculated from the sum of rows and columns from total For the data from the example 11 4 the table look
70. from 2 independent populations 2 5050 008s 180 14 1 5 The test for checking the equality of the coefficients of linear regression equation which come from 2 independent populations 0 e a 181 14 2 NONPARAMETRIC TESTS 1 0 2 eee 183 14 2 1 THE MONOTONIC CORRELATION COEFFICIENTS 2 00022 eee 183 14 2 2 The test of significance for the Soearman s rank order correlation coefficient 184 14 2 3 The test of significance for the Kendall s tau correlation coefficient 186 Copyright 2010 2014 PQStat Software All rights reserved 2 CONTENTS 14 2 4 CONTINGENCY TABLES COEFFICIENTS AND THEIR STATISTICAL SIGNIFICANCE 188 15 AGREEMENT ANALYSIS 194 15 1 PARAMETRIC TESTS 0 2 eee ee ee ee ee ee ee 195 15 1 1 The intraclass correlation coefficient and the test of its significance 195 15 2 NONPARAMETRIC TESTS 2 0 00 0 eee eee eee ee ee he ee 199 15 2 1 The Kendall s coefficient of concordance and the test of its significance 199 15 2 2 The Cohen s Kappa coefficient and the test of its significance 0 202 16 DIAGNOSTIC TESTS 206 16 1 EVALUATION OF DIAGNOSTIC TEST 0 ee ee 206 1o el bac hee ee OEE eee ee eee eb be eee eee ae Bee Se 210 16 2 1 Selection of optimum cut off oaoa ea ee 213 16 2 2 ROG curves comparison 2666 we eh ARE OAKES OF OHSS HOH RG HSS 217 17 MULTIDIMENSIONAL MODELS 224 17 1 PREPARATION OF THE VARIABLE
71. growing by 1 unit The OR value is interpreted as follows e OR gt 1 means the stimulating influence of the studied independent variable on obtaining the distinguished value 1 i e it gives information about how much greater are the odds of the occurrence of the distinguished value 1 when the independent variable grows by 1 unit e OR lt 1 means the destimulating influence of the studied independent variable on obtain ing the distinguished value 1 i e it gives information about how much lower are the odds of the occurrence of the distinguished value 1 when the independent variable grows by 1 unit e OR x 1 means that the studied independent variable has no influence on obtaining the distinguished value 1 Odds Ratio the general formula The PQStat program calculates the individual Odds Ratio Its modification on the basis of a general formula makes it possible to change the interpretation of the obtained result The Odds Ratio for the occurrence of the distinguished state in a general case is calculated as the ratio of two odds Therefore for the independent variable X for Z expressed with a linear relationship we calculate the odds for the first category Odds 1 PU e 1 Pot Pi X1 I P2X2 BeXr 1 P 1 the odds for the second category Odds 2 P 2 ef 2 ebotb1X 2 82X2 6k Xk 1 P 2 Copyright 2010 2014 PQStat Software All rights reserved 246 17 MULTIDIMENSIONAL MOD
72. import any formatting and formulas Copying data with relation Data from one datasheet can be copied to another selected datasheet on the basis of relation That kind of copying is done by selecting from the menu Data Copying with relation Copyright 2010 2014 PQStat Software All rights reserved 9 SIAT 3 WORKING WITH DOCUMENTS p g3 Create a relationship for copying data Source sheet from Destination sheet to a a i Related Variables Select variables that you want to import Select destination Insert after oo In order to build a relationship one ought to select the datasheet from which the copying is to be done and the datasheet into which the copied data will be transfered Both datasheets ought to have the same key i e the variable the values of which identify each row in the datasheet The key for the source datasheet must be unique The principle of the design is a one to many relationship i e one row from the source datasheet can be related to many rows from the destination datasheet The keys of both datasheets ought to be selected as Related variables Having set the relationship as described above we select the variables to be copied and to the column after which the copied variables are to be placed 3 1 3 DATASHEET WINDOW Rows and columns of a datasheet are marked with successive natural numbers You can give your own header to each column in a place where grey colour occu
73. in this application is based on projects Each project is a separated file A project is an object of the similar meaning to a worksheet which consists of 3 basic elements 1 Datasheets including map sheets and matrixs the number of sheets in a given project is limited to 255 2 Results sheets reports the number of reports in a given datasheet is limited to 1024 3 Project manager it enables you to change the name of datasheets and results add your own descriptions and notes and export It is possible to work on 255 opened projects at the same time The first one altogether with an empty sheet is created automatically right after the application is launched and if the appropriate option in the application settings is selected Another projects can be created by File menu New project Ctrl N button on the toolbar Created projects files with pqs pqx extension can be opened by File Open project Ctrl O 9 button on the toolbar File gt Open recent File Open examples it applies to the examples attached to the application drag the project file into the application window by double clicking the project file The project can be saved by File menu Save Ctrl S File Save as Save button in the Project Manager l button on the toolbar Saving the project causes that all project elements are saved in a file with pqs or pqx extension The project can be
74. know from various sources that sex can have an influence on the survival function as regards leukemia in that survival functions can be distributed disproportionately with respect to each other along the time line That is why we create the Cox model for three variables Sex Rx and log WBC Before interpreting the coefficients of the model we will check Schonfeld residuals We will present them in graphs and their results together with time will be copied from the report to a new data sheet where we will check the occurrence of Spearman s monotonic correlation The obtained values are p 0 0259 for the time and Shoenfeld residuals correlation for sex p 0 6192 for the time and Shoenfeld residuals correlation for log WBC and p 0 1490 for the time and Shoenfeld residuals correlation for Rx which confirms that the assumption of hazard proportionality has not been fulfilled by the sex variable Therefore we will build the Cox models separately for women and men For that purpose we will make the analysis twice with the data filter switched on First the filter will point to the female sex 0 second to the male sex 1 For women B coeff B error 95 CI 95 CI Wald stat p value Hazard rat 95 CI 95 CI log WBC 1 1701250 0 4985684 0 1929488 2 1473011 5 5082668 0 0189267 3 2223954 1 2128207 8 5617207 Rx 0 2667231 0 5659162 0 642452 1 3756985 0 2221350 0 6374179 1 3056788 0 43065351 3 95663235 For men Copyright 2010
75. latter In order to avoid the over parametrization in a model in which there are interactions of dichoto mous variables it is recommended to choose the option effect coding 17 2 MULTIPLE LINEAR REGRESSION The window with settings for Multiple Regression is accessed via the menu Statistics Multidimen sional Models Multiple Regression Copyright 2010 2014 PQStat Software All rights reserved 227 Multiple Regression Statistical analysis Multiple Regression Variable A1Ad2 prod_c 4 advert_c eprom ic brebates popular author Data Filter Set of the conditions that are applied to data to C3 produce a subset of your data All the rules are combined using the logical AND basic 1 mutti AND Report options Add analysed data Mean std dev Add graph Cor covanance matrix Part cor redundancy Residual analysis The constructed model of linear regression allows the study of the influence of many independent variables X 1 X2 Xk on one dependent variable Y The most frequently used variety of mul tiple regression is Multiple Linear Regression It is an extension of linear regression models based on Pearson s linear correlation coefficient It presumes the existence of a linear relation between the stud ied variables The linear model of multiple regression has the form Y bo 61X11 b2X 2 Ok Xk where Y dependent variable explained by the model X1 X2
76. level Convergence has been reached Size Number of estimated parameters Frequency 0 control Frequency 1 case Likelihood ratio test Log Likelihood 2 Log Likelihood Log Likelihood intercept 2 Log Likelihood intercept Chi square statistic 418 000866 836 001732 469 5809235 939 618471 103 616739 Degrees of freedom 9 lt 0 000001 0 110275 0 166989 0 141722 p value Pseudo R2 R2 Nagelkerke R2 Coma Snella Hosmer Lemeshow test Chi square statistic Degrees of freedom p value Wald stat 4 913783 95 CI 1 166146 95 CI 0 170709 95 CI 2 777096 b error 0 664907 p value odds ratio intercept 1 473902 0 026643 4 366241 16 072277 Address fRes BirthWeight MAge FregNo SponAbort RespTint Smoking MEdu 0 040877 0 464687 0 307868 0 033758 0 293138 0 433693 1 495785 1 490982 0 183437 0 171507 0 170064 0 131076 0 018671 0 100444 0 303193 0 277773 0 411668 0 101185 0 377024 0 131368 0 564772 0 070353 0 096271 1 02794 0 95136 0 663736 0 381755 0 29527 0 798005 0 050963 0 002837 0 490005 0 160554 2 040209 2 290227 0 014881 0 056807 7 465616 3 516717 3 268947 8 517193 2 046102 28 99738 13 104768 3 266588 0 611616 0 006287 0 018836 0 070603 0 003518 0 152596 0 000001 0 000295 0 069848 0 959947 1 591515 0 735013 0 966805 1 340628 0 648111 4 462837 4 441453 0 832404 0 685899 1 140387 0 56849 0
77. made with the assumption that each of the five characteristics enumerated by the client is equally important but one can also point to the characteristics which should have a greater influence over the result of the analysis We will build two matrices of Euclidean distances 1 In the first matrix there will be Euclidean distances calculated on the basis of the five character istics when equally treated 2 In the second matrix there will be those Euclidean distances in the construction of which the number of rooms and the distance to the district center play the most important role In order to build the first matrix we select 5 normalized variables in the matrix window marked as Norm the Euclidean metric and the Identifier of the object Flat variable Copyright 2010 2014 PQStat Software All rights reserved 32 Pa Similarity matrix gt sa Options Select the source Weights matrix Map According to distance According to contiguity ID Metric Distance 1 Flat 1 Flat B 2 Retail property E Euclidean 3 In district A S a 4 In a low block of flats O Mahalanobisa 5 Not renovated A Mint i T Floor on which the flat is locate CityBlock 8 Age of the building 9 Distance of the district center O Chebyshev 10 a y of a bus or tram sto Cosine Neighborhood 0 1 11 Norm No of the rooms Gl paseset et 12 Normi Floor on which the flat Bray Curtis a 13 Norm Age of the building E i 1 a 1
78. of items Mean of scale Standard deviation of scale Cronbach Alpha for scale Standard error of measurement Average correlation between pairs of items Standardized Cronbach alpha Mean E 957C Stand dev KEL RRS KES KRA KES KKG KES A more precised analysis of each item indicates that except the last one they all influence scale reli ability in a similar way Correlation between the KK7 item and the other scales items is the weakest 0 026954 Removing the KK7 item from the scale the Cronbach alpha coefficient would increase to 0 803619 Similar conclusion can be drawn on the basis of split half reliability analysis carried out on the items randomly divided into 2 halves KK1 KK3 KK5 KK2 KK4 KK6 KK7 Copyright 2010 2014 PQStat Software All rights reserved 308 i 20 RELIABILITY ANALYSIS Analysis time Analysed variables Significance level Group size Mean of scale Standard deviation of scale Correlation between two halves of scale Split half reliability Standard error of measurement Guttman split half reliability Firsh half Number of items Names of items Mean Standard deviation Cronbach Alpha Second half Number of items Names of items Mean Standard deviation Cronbach Alpha KK1 KK3 KK5 KK2 KK4 KK6 KK7 0 05 24 26 083333 5 96305 0 750862 0 857705 2 24936 0 856531 3 KK1 KK3 KK5 11 625 3 076029 0 607122 A KK2 KK4 KK6 KK7 14 458333 3 296628 0 416
79. one observer twice recurrence The amp coefficient is calculated for categorial dependent variables and its value is included in a range from 1 to 1 A 1 value means a full agreement O value means agreement on the same level which would occur Copyright 2010 2014 PQStat Software All rights reserved 202 15 AGREEMENT ANALYSIS sl for data spread in a contingency table randomly The level between 0 and 1 is practically not used The negative amp value means an agreement on the level which is lower than agreement which occurred for the randomly spread data in a contingency table The amp coefficient can be calculated on the basis of raw data or ac x c contingency table To calculate the amp coefficient you need to transform a contingency table for the observed frequencies Oj 11 6 into the contingency table of probabilities p 15 1 Table 15 1 The c x c contingency table of probabilities probabilities XO Eue MERE Si where Po J Pii Pe Dr Pi Pi or equivalently amp X Ou gt Ei n X Ei where Oi Ei are the observed frequencies and the expected frequencies of main diagonal Note k the coefficient of an agreement in a sample k the coefficient of an agreement in a population The standard terror of Kappa Hanley 1987 38 is defined by VA B C ee ee ieee a ag where A Da 1 Pis l Oe ae amp B 1 Xiz Pij Di PI C R P 1
80. pal component analysis or discriminant analysis Those methods allow the detection of relationships among the variables On the basis of those relationships one can distinguish for further analysis groups of similar variables and select only one representative one variable of each group or a new variable the values of which are calculated on the basis of the remaining variables in the group As a result one can be certain that the information carried by each group is included in the analysis In this manner we can reduce a set of variables p to a set of variables k where k lt p with only a small loss of information 18 1 PRINCIPAL COMPONENT ANALYSIS The window with settings for Principal component analysis is accessed via the menu Statistics Mul tivariate Models Principal Component Analysis Principal Component Analysis Statistical analysis Principal Component Analysis Vanable Test optons 2 Sepal Length B e a E ES a ancien Eigenvalues Petal Length haar eo en On the base of a covanance 5 Petal Viicth Factor loadings Communaltes Vanable contributions Bartlett test Kaiser Mayer Ollin coeficent Data Fiter Set of the condimons that are apphed to data to produce a subset of your data All the rules are combined using the logical AND J basic Omi Report optons Add analysed data Add Prnapal Components Add graph OK EJ close Principal component analysis involves defining complet
81. patient at the time of the transplantation was in the range of 45years 60years A fragment of the collected data is presented in the table below 14 dead hospital 1 lt 45 50 1 21 alive hospital 1 lt 45 50 1 4 dead hospital 1 lt 50 55 2 4 alive hospital 1 lt 50 55 2 5 dead hospital 1 lt 50 55 2 6 alive hospital 1 lt 50 55 2 6 alive hospital 1 lt 50 55 2 9 dead hospital 1 lt 50 55 2 16 alive hospital 1 lt 50 55 2 The complete data in the analysis are those as to which we have complete information about the length of life after the transplantation i e described as death it concerns 53 people which constitutes 59 55 of sample The censored data are those about which we do not have that information because at the time when the study was finished the patients were alive 36 people i e 40 45 of them We build the life tables of those patients by creating time periods of 3 years Copyright 2010 2014 PQStat Software All rights reserved 278 19 SURVIVAL ANALYSIS Interval Censored Failure eve Entered At risk Failure eve Censored Cumulatiw Probability Hazard rat Std error Std error Std erro 0 3 o 89 89 0 0561797 0 9438202 1 0 0187265 0 0192678 0 0 0081361 0 0086132 3 6 5 a4 81 5 0 1226993 0 8773006 0 9438202 0 0386020 0 0435729 0 0244084 0 0114771 0 0137495 6 9 14 69 0 2580645 0 7419354 0 8280140 0 0712270
82. prearranged collection of data from any datasheet or import data The amount of data which one datasheet is able to take in is limited to 4 millions of rows and 1 thousand of columns No more than 40 characters can be put in each cell Data import You can easily import data from xls xlsx txt csv files with encoding of UTF8 Windows 1250 shp SHP SHX DBF ESRI Shapefile dbf dBase III dBase IV dBase VII dbf FoxPro To perform an import operation you should click Import from menu Copyright 2010 2014 PQStat Software All rights reserved 8 3 WORKING WITH DOCUMENTS Pa Ke Import data from ALS ALSA file Options File name C Program Files PQStat Dane EN_database_import ds Separator comma semicolon with headers Preview j CH o In the import window there is a possibility to preview data importing and prior verification of import results depending on the way of data interpretation To avoid misinterpretation of national characters you should pay special attention on the correctness of screened characters in a preview window If the files are huge the preview window displays only the beginning of the data from the given file Note In applications like Microsoft Office Excell 2000 2007 the default character encoding is Windows 1250 Data importing from Microsoft Excel documents is with reference to cells values only There is no pos sibility to
83. ques tion and between questions 2 nd and 3 th The difference is because the second question is easier than the first and the third ones the number of correct answers the first question is higher Copyright 2010 2014 PQStat Software All rights reserved 166 13 STRATIFIED ANALYSIS 13 STRATIFIED ANALYSIS 13 1 THE MANTEL HAENSZEL METHOD FOR SEVERAL 2x2 TABLES The Mantel Haenszel method for 2 x 2 tables proposed by Mantel and Haenszel 1959 56 then it was extended by Mantel 1963 57 A wider review the development of these methods was carried out i a by Newman 2001 66 This method can be used in analysis 2 x 2 tables that occur in several w gt 2 stratas constructed by confounding variable For the next stratas s 1 w the 2 x 2 contingency tables for observed frequencies are created Observed frequencies Analysed phenomenon illness sah svara 04 Ta exposed Risk factor unexposed Total OF 03 n Ory Oy O27 On The settings window with the Mantel Haenszel OR RR can be opened in Statistics menu Stratified analysis Mantel Haenszel OR RR s s Mantel Haenszel OR RR Statistical analysis Mantel Haenszel OR RA homogeneity with saved selection Test options Report options Odds Ratio Relative Risk Add analysed data Add graph 13 1 1 The Mantel Haenszel odds ratio If all tables created by individual stratas are homogeneous the y test of homogeneity for
84. question one can get 1 5 points where 1 the lowest mark 5 the highest mark The maximum score of the questionnaire is 35 In the table there are scores obtained by 24 candidates Copyright 2010 2014 PQStat Software All rights reserved 306 STM 20 RELIABILITY ANALYSIS 3 2 25 24 25 30 31 24 31 19 21 27 16 26 31 22 26 32 30 13 35 29 31 18 35 UW U U U U m O CON DU BWYN OUNUuUUuUWNrPUuUNrP PR WP HhRWWUP PR HRP UU MreOwmrFPwuwm fk fPUowoerehl PrN UP BON UW A nNnoOnnnumns nonnsbp OPWWWONYDL YUH OW A Mwouwnunuwnunununstb urinstb A speurnnn wo W MnNOOUONnN WU oP Uu ore AA Pwr wo UWV W mBbBbreuUuMmNrPuUuUuUuN oO PU NOrPBRNNU UMN oH N UI OrPUrP UP BRNPBRNPRPWRRP PWN WRENN PB For checking the accuracy of the competence scale the reliability should be analysed The correlation matrix indicates that the last item is least correlated with the other items Thus it is suspected that the item does not measure the same construct as the others The competence scale turned out to be a reliable scale Cronbach alpha coefficient is 0 736805 and mean of all the Pearson s correlation coefficients is 0 31847 Copyright 2010 2014 PQStat Software All rights reserved 307 20 RELIABILITY ANALYSIS Deleted item Scale mean if ttem deleted Scale standard deviation if item deleted Correlation between deleted item and sum of remaining Cronbach Alpha if item deleted Group size Number
85. rights reserved 305 20 RELIABILITY ANALYSIS If two halves randomly selected are ideally correlated rsg 1 A formula for the split half reliability coefficient proposed by Guttman 1945 36 where sd sd variance of the first and the second half of a scale sd variance of the sum of all scales items Note The scale is realiable if the scales reliability coefficients aC Astandard TSH SHG are larger than 0 6 and smaller than 1 Standard error of measurement is calculated for the reliable scale according to the following formula SEM sdiVv1 ac for the Cronbach s alpha coefficient of reliability or SEM sdhVl rsy for the split half reliability coefficient The settings window with the Cronbach s alpha Split half can be opened in Statistics menu Scale reliability scale Reliability Statistical analysis Cronbach s alphalSplithalt H _ c HH_ Data Filter set of the conditions that are applied to data to produce a subset of your data All the rules are combined using the logical AND basic mlipe W Add analysed data 01 More results Add graph Correlation matrix EXAMPLE 20 1 scale pqs file A competence scale created in some company enables an assessment of the usefulness of future employees Apart from participation in a job interview candidates fill in the questionnaire that includes the competence scale questions There are 7 questions in the scale For each
86. sizes has the x distribution with k degrees of free dom On the basis of test statistics p value is estimated and then compared with a ifpo lt a wereject Ho and accept H1 ifp gt a gt _ there is no reason to reject Ho 19 4 3 Analysis of model residuals The analysis of the of the model residuals allows the verification of its assumptions The main goal of the analysis in Cox regression is the localization of outliers and the study of hazard proportionality Typically in regression models residuals are calculated as the differences of the observed and predicted values of the dependent variable However in the case of censored values such a method of determining the residuals is not appropriate In the program we can analyze residuals described as Martingale deviance and Schoenfeld The residuals can be drawn with respect to time or independent variables Hazard proportionality assumption A number of graphical methods for evaluating the goodness of fit of the proportional hazard model have been created Lee and Wang 2003 49 The most widely used are the methods based on the model residuals As in the case of other graphical methods of evaluating hazard proportionality this one is a subjective method For the assumption of proportional hazard to be fulfilled the residuals should not form any pattern with respect to time but should be randomly distributed around value 0 Martingale the residuals can be interpreted as a differen
87. so it must be estimated from the sample There are following values for this sample f 343 00 s 191 Analysis time Analysed variables significance level Group size Group mean Group standard deviation D statistic Degrees of freedom p value waiting days 0 05 22 EFEFEF 1 906925 Copyright 2010 2014 PQStat Software All rights reserved 90 aS 10 COMPARISON 1 GROUP Analysis time Analysed variables waiting days significance level 0 05 Group size 22 Group mean oer els Group standard deviation 1 906925 Frequency waiting days The value of the Kolmogorov Smirnov and the Lilliefors test statistic is exactly the same and amounts to 0 1357 but the p value 0 763881 for the Kolmogorov Smirnov test and the p value 0 364381 for Lilliefors test Both tests indicate that using the significance level a 0 05 you are not allowed to reject the null hypothesis which informs that the analysed data performs the normal distribution 10 2 2 The Wilcoxon test signed ranks The Wilcoxon signed ranks test is also known as the Wilcoxon single sample test Wilcoxon 1945 1949 83 This test is used to verify the hypothesis that the analysed sample comes from the pop ulation where median is a given value Basic assumptions measurement on an ordinal scale or on an interval scale Hypotheses Ho 0 Oo H 0 Oo Copyright 2010 2014 PQStat Software All rights reserved 91 10 C
88. test RxC Variable 2 Data Filter set of the conditions that are applied to data ta i produce a subset of your data All the rules are combined using the logical AND i basic E multiple anD Report options Add analysed data E Add percentages Rows 7 Mor EXAMPLE 11 6 country education paqs file There is a sample of 605 persons n 605 who had 2 features analysed for X country of residence Y education The first feature occurrs in 4 categories and the second one in 3 categories X Country 1 X2 Country 2 X3 Country 3 X4 Country 4 Y primary Yo secondary Y3 higher The data dis tribution is shown below in the contingency table Primary Secondary Higher Country 1 50 56 43 Country 2 51 70 33 Country 3 5 45 5 Country 4 65 45 40 Based on this sample you would like to find out if there is any dependence between education and country of residence in the analysed population Hypotheses Ho there is no dependence between education and country of residence in the analysed population H there is a dependence between education and country of residence in the analysed population Copyright 2010 2014 PQStat Software All rights reserved 122 11 COMPARISON 2 GROUPS Analysis time Analysed variables Country Education Significance level 0 05 Chi square statistic 13 817861 Degrees of freedom 6 p value 0 031738 Country 1 Country 2 Country 3 Country 4 The table of the expec
89. test 2012 12 9 21 02 55 Note E Mann Whitney Mann Whitney U test 2012 12 9 21 02 55 Note E Mann Whitney Mann Whitney U test 2012 12 9 21 02 55 Note TF Logistic Regression Logistic Regression 2012 12 9 21 03 14 Note 3 WORKING WITH DOCUMENTS sl The most important element in each project is a datasheet Each open project must contain at least one datasheet 3 1 HOW TO WORK WITH DATASHEETS 3 1 1 HOW TO ADD TO DELETE AND TO EXPORT DATASHEETS The first empty datasheet will be opened automatically altogether with a new project Another datasheets can be added to the project by File menu Add datasheet Ctrl D button on the toolbar Add datasheet to the Project Manager You can delete a datasheet by context menu Delete sheet Shift Del on the name of a datasheet in a Navigation Tree button Delete in the Project Manager for selected sheet sheets However you should remember if there are any reports or map added to a datasheet and you delete datasheet all reports map attached to it will be deleted too Datasheets can be described in the Project Manager by adding a name title or a note All datasheets created in PQStat can be exported to csv txt dbf and xls format You can do this by clicking button Eksport to in the Project Manager for selected sheet sheets 3 1 2 HOW TO INSERT DATA INTO A SHEET Creating a datasheet it is empty You can insert some data copy
90. than select the option 1 p value Confirm all the chosen settings by clicking Calculate N 161 15 13 03 150 The obtained p value is 0 806039 3 To calculate using the normal distribution Gauss a probability that the client you have chosen used free minutes which come from the range sd sd 148 12min 174 18mzn in the Statistic field put one of the final range values and than select the option two sided Confirm all the chosen settings by clicking Calculate N 161 15 13 03 148 12 174 18 The obtained p value is 0 682689 4 To calculate using the normal distribution Gauss a probability that the client you have chosen used free minutes out of the range T sd sd 148 12min 174 18min in the Statistic field put one of the final range values and than select the option two sided and 1 p value Confirm all the chosen settings by clicking Calculate N 161 15 13 03 148 12 174 18 Copyright 2010 2014 PQStat Software All rights reserved 79 8 PROBABILITY DISTRIBUTIONS The obtained p value is 0 317311 Copyright 2010 2014 PQStat Software All rights reserved 80 9 HYPOTHESES TESTING 9 HYPOTHESES TESTING The process of generalisation of the results obtained from the sample for the whole population is di vided into 2 basic parts e estimation estimating values of the parameters of the population on the basis of the statistical sample e v
91. the OR can check this condition then on the basis of these tables the pooled odds ratio with the confidence interval can be designated Such odds ratio is a weighted mean for an odds ratio designated for the individual stratas The usage of the weighted method proposed by Mantel and Haenszel allows to include the contribution of the strata weights Each strata has an influence on the pooled odds ratio the greater size of the strata the greater weight and the greater influence on the pooled odds ratio Copyright 2010 2014 PQStat Software All rights reserved 167 13 STRATIFIED ANALYSIS sl Weights for individual stratas are designated according to the following formula FOr o o3 and the Mantel Haenszel odds ratio where a 1 ae R arom Cir Ong g 579 s 1 The confidence interval for logO0 Rj is designated on the basis of the standard error RGB Robins Breslow Greenland 70 71 calculated according to the following formula T UY W 2 R 2RS 2S SEMH of of 0 082 T 3 T s Ts s 1 P ny s A s s 2 r nis r r 3 o E of Oy of o 2 i n no fj s s w S wo w On Oi On 088 sl nls a The Mantel Haenszel y test for the ORmH The Mantel Haenszel Chi square test for the ORm g is used in the hypothesis verification about the significance of designated odds ratio O Rm It should be calculated for large frequencies i e when both conditions of t
92. the correction for ties This correction is used when ties occur if there are no ties the correction is not calculated because of C 0 Note W the Kendall s coefficient in a population n n W the Kendall s coefficient in a sample The value of W lt 0 1 gt and it should be interpreted in the following way e W 1meansa strong concordance in judges assessments e W 0 means a lack of concordance in judges assessments The Kendall s W coefficient of concordance vs the Spearman r coefficient When the values of the Spearman r correlation coefficient for all possible pairs are calculated the average r coefficient marked by 7 is a linear function of W coefficient 7 nW Poom n 1 The Kendall s W coefficient of concordance vs the Friedman ANOVA The Kendall s W coefficient of concordance and the Friedman ANOVA are based on the same mathematical model As a result the value of the chi square test statistic for the Kendall s coef ficient of concordance and the value of the chi square test statistic for the Friedman ANOVA are the same The chi square test of significance for the Kendall s coefficient of concordance Basic assumptions measurement on an ordinal scale or on an interval scale Hypotheses Ho W 0 H W 0 Copyright 2010 2014 PQStat Software All rights reserved 200 15 AGREEMENT ANALYSIS A The test statistic is defined by _ xX n k 1 W This statistic as
93. the delivery which is supposed to be delivered by the above mentioned courier company is different from 3 Copyright 2010 2014 PQStat Software All rights reserved 86 10 COMPARISON 1 GROUP Analysis time Analysed variables waiting days Significance level 0 05 Group size 22 Hypothetical mean 3 Group mean J J27273 Std err of the group mean 0 406558 Group standard deviation 1 906925 95 CI for the group mean 2 581789 95 CI for the group mean 4 572756 Difference of the means 0 727273 95 CI for the difference 0 118211 95 CI for the difference t statistic Degrees of freedom p value 0 088074 Mean EE 953 CI Stand dev waiting days 2 15 Comparing the p value 0 088074 of the t test with the significance level a 0 05 we draw the con clusion that there is no reason to reject the null hypothesis which informs that the average time of awaiting for the delivery which is supposed to be delivered by the analysed courier company is 3 For the tested sample the mean is z 3 727 and the standard deviation is sd 1 907 Copyright 2010 2014 PQStat Software All rights reserved 87 10 COMPARISON 1 GROUP 10 2 NONPARAMETRIC TESTS Ranks there are the following numbers usually natural ones ascribed to the values of ordered mea surements of the analysed variable They are usually used in such nonparametric tests which are based only upon the order of elements in the sample Replacing a
94. the differences between all the pairs of measure ments are pretty close to each other This test is used to verify the hypothesis determining the equality of means of an analysed variable in several k gt 2 populations Basic assumptions measurement on an interval scale the normal distribution for all variables which are the differences of measurement pairs or the normal distribution for an analysed variable in each measurement a dependent model Hypotheses Ho M H2 Hk H notall u are equal j 1 2 k where U H 2 Hk means for an analysed features in the following measurements from the examined population The test statistic is defined by MSpgce F M Sres Copyright 2010 2014 PQStat Software All rights reserved 152 SIAT 12 COMPARISON MORE THAN 2 GROUPS D where gg Mopo BC _ mean square between conditions df Bc 55 M Spes Tee mean square residual Tres k n 2 Z D y Lis i 1 Lit Tij SSpo 2a between conditions sum of squares GA SSres S Sr SSBs SSpBc residual sum of squares 2 k ae z2 De vi SSr xi n a total sum of squares j l i 1 n k A k n 2 j 1 Lij a gt i Tij l SSps E aan a a between subjects sum of squares i 1 dfgco k 1 between conditions degrees of freedom dfres dfr dfpc dfps residual degrees of freedom dfr N 1 total
95. the hazard ratio H R O1 EY O2 Ev If the hazard ratio is greater than 1 e g H R 2 then the degree of the risk of a failure event in the first group is twice as big as in the second group The reverse situation takes place when H R is smaller than one When H R is equal to 1 both groups are equally at risk HR Note The confidence interval for H R is calculated on the basis of the standard deviation of the H R logarithm Armitage and Berry 1994 5 Copyright 2010 2014 PQStat Software All rights reserved 284 19 SURVIVAL ANALYSIS 19 3 2 Survival curve trend Hypotheses Ho Inthe studied population there is no trend in the placement of the S1 S2 5 curves H Inthe studied population there is a trend in in the placement of the S1 S9 Sk curves In the calculation the chi square statistic was used in the following form I CU a Vc where c C1 C2 Ck vector of the weights for the compared groups informing about their natural order usually the subsequent natural numbers The statistic asymptotically for large sizes has the y distribution with 1 degree of freedom On the basis of test statistics p value is estimated and then compared with the significance level a fp lt a wereject Ho and accept H fp gt a gt there is no reason to reject Ho In order to conduct a trend analysis in the survival curves the grouping variable must be a numerical variable in which
96. the independent variables should not be multi collinear In a case of multicollinearity estimation can be uncertain and the obtained error values very high The multicollinear variables should be removed from the model or one independent variable should be built of them e g instead of the multicollinear variables of mother age and father age one Copyright 2010 2014 PQStat Software All rights reserved 245 17 MULTIDIMENSIONAL MODELS sl can build the parents age variable Note The criterion of convergence of the function of the Newton Raphson iterative algorithm can be con trolled with the help of two parameters the limit of convergence iteration it gives the maximum num ber of iterations in which the algorithm should reach convergence and the convergence criterion it gives the value below which the received improvement of estimation shall be considered to be insignif icant and the algorithm will stop 17 4 1 Odds Ratio Individual Odds Ratio On the basis of many coefficients for each independent variable in the model an easily inter preted measure is estimated i e the individual Odds Ratio OR efi The received Odds Ratio expresses the change of the odds for the occurrence of the distinguished value 1 when the independent variable grows by 1 unit The result is corrected with the remain ing independent variables in the model so that it is assumed that they remain at a stable level while the studied variable is
97. the samples chosen randomly from the 1st and the 2nd population The test statistic has the F Snedecor distribution with n 1 and ns 1 degrees of freedom The p value designated on the basis of the test statistic is compared with the significance level a fp lt a gt reject Ho and accept H ifp gt a gt there is no reason to reject Ho The settings window with the Fisher Snedecor test can be opened in Statistics menu Parametric tests gt F Fisher Snedecor Copyright 2010 2014 PQStat Software All rights reserved 102 ee F Fisher Snedecor Test options Use the grouping variable Data Filter Set of the conditions that are applied to data to produce a subset of your data All the rules are combined using the logical AND B basic multiple AND Add analysed data Z Add graph Note Calculations can be based on raw data or data that are averaged like arithmetic means standard devi ations and sample sizes 11 1 2 The t test for independent groups The t test for independent groups is used to verify the hypothesis about the equality of means of an analysed variable in 2 populations Basic assumptions measurement on an interval scale normality of distribution of an analysed feature in both populations an independent model equality of variances of an analysed variable in 2 populations Hypotheses Ho H1 H2 Hi m F pe where u1 2 means of an analysed
98. to the number of the estimated parameters of the model n gt k 1 17 2 1 Model verification e Statistical significance of particular variables in the model On the basis of the coefficient and its error of estimation we can infer if the independent variable for which the coefficient was estimated has a significant effect on the dependent variable For that purpose we use t test Hypotheses Ho bi 0 H D A 0 Let us estimate the test statistics according to the formula below a S Er The test statistics has t Student distribution with n k degrees of freedom The p value designated on the basis of the test statistic is compared with the significance level Q fp lt a gt reject Ho and accept H ifp gt a gt there is no reason to reject Ho e The quality of the constructed model of multiple linear regression can be evaluated with the help of several measures The standard error of estimation it is the measure of model adequacy Da e Lesy n ae i n k 1 Copyright 2010 2014 PQStat Software All rights reserved 229 17 MULTIDIMENSIONAL MODELS sl The measure is based on model residuals e y y that is on the discrepancy between the actual values of the dependent variable y in the sample and the values of the indepen dent variable y estimated on the basis of the constructed model It would be best if the difference were as close to zero as possible for all studied properties of th
99. to the time minute v1 returns the minutes ascribed to the time second v1 returns the seconds ascribed to the time yeardiff v1 v2 returns the difference in years between two dates monthdiff v1 v2 returns the difference in months between two dates weekdiff v1 v2 returns the difference in weeks between two dates daydiff v1 v2 returns the difference in days between two dates hourdiff v1 v2 returns the difference in hours between two times minutediff v1 v2 returns the difference in minutes between two times seconddiff v1 v2 returns the difference in seconds between two times compdate v1 v2 compares two dates and returns the number 1 when v1 gt v2 0 if v1 v2 1 if v1 lt v2 Logical functions if question yes answer no answer the question has the form of a statement which can be true or false The function returns one value if the statement is true and another value if it is false and conjunction operator returns the truth 1 when all the conditions it connects are true Copyright 2010 2014 PQStat Software All rights reserved 19 3 WORKING WITH DOCUMENTS otherwise it returns falsity 0 or alternative operator returns the truth 1 when at least one of the conditions it connects is true otherwise it returns falsity 0 xor either or operator returns the truth 1 when one of the conditions it connects is true otherwise it returns falsity 0 n
100. using the logical AND basic miipe 3 Report options Add analysed data More results Add graph EXAMPLE 12 1 age ANOVA pas file There are 150 persons chosen randomly from the population of workers of 3 different transport com panies From each company there are 50 persons drawn to the sample Before the experiment begins you should check if the average age of the workers of these companies is similar because the next step of the experiment depends on it The age of each participant is written in years Age company 1 27 33 25 32 34 38 31 34 20 30 30 27 34 32 33 25 40 35 29 20 18 28 26 22 24 24 25 28 32 32 33 32 34 27 34 27 35 28 35 34 28 29 38 26 36 31 25 35 41 37 Age company 2 38 34 33 27 36 20 37 40 27 26 40 44 36 32 26 34 27 31 36 36 25 40 27 30 36 29 32 41 49 24 36 38 18 33 30 28 27 26 42 34 24 32 36 30 37 34 33 30 44 29 Age company 3 34 36 31 37 45 39 36 34 39 27 35 33 36 28 38 25 29 26 45 28 27 32 33 30 39 40 36 33 28 32 36 39 32 39 37 35 44 34 21 42 40 32 30 23 32 34 27 39 37 35 Before you do this example it is worth starting with the similar task but related to 2 groups only 11 7 Hypotheses Ho the average age of the workers off all the analysed transport companies is the same H atleast 2 means are different Copyright 2010 2014 PQStat Software All right
101. value Quarter I Quarter II Quarter III Quarter I 0 05 Quarter I i4 17 1 214286 1461 Quarter II i4 32 Quarter III i4 47 Quarter Iv 12 COMPARISON MORE THAN 2 GROUPS D Quarter II Quarter IIl Quarter IV Comparing the p 0 000026 with the significance level a 0 05 we state that the chocolate bar sale is not the same in each quarter The POST HOC analysis indicates the difference in the sale in quarters I I and I IV Copyright 2010 2014 PQStat Software All rights reserved 160 Quarter I Quarter II Quarter III Quarter IV Quarter I Quarter II Quarter III Quarter I v Quarter I Quarter IT Quarter III Quarter Iv Quarter I Quarter IT Quarter III Quarter IY 1 07145 2 14266 1 927657 1 26734 1 28734 1 28734 2 19578 4 39155 3 9524 0 16865 0 00007 0 00046 12 Quarter I Quarter II 1 07143 1 07143 0 85714 1 28734 2 19578 2 19578 1 75662 0 16665 0 4739 COMPARISON MORE THAN 2 GROUPS 2 14266 1 07145 0 21429 1 26734 1 267354 4 39155 2 19578 0 43916 0 00007 0 16865 Quarter II Quarter I 3 9524 1 75662 0 43916 0 00046 0 4739 1 1000 Quarter I Quarter I Quarter M Quarter IV 12 2 3 The Chi square test for multidimensional contingency tables The x7 test for multidimensional contingency tables is an extension to the y test for R x C tables for more than two features Basic assumptions mea
102. variable of the 1st and the 2nd population Copyright 2010 2014 PQStat Software All rights reserved 103 11 COMPARISON 2 GROUPS SMA The test statistic is defined by ha T T2 we fe paN EE E Nin 2 n 19 where T Z2 means of an analysed variable of the 1st and the 2nd sample N1 no the 1st and the 2nd sample size sd sd variances of an analysed variable of the 1st and the 2nd sample The test statistic has the t Student distribution with df n na 2 degrees of freedom The p value designated on the basis of the test statistic is compared with the significance level a fp lt a gt reject Ho and accept H ifp gt a gt there is no reason to reject Ho Note e pooled standard deviation is defined by n sd ng sds SD ao 1n2 e standard error of difference of means is defined by ni sd n sd 1 1 SEz z Nin 2 nn 11 1 3 The t test with the Cochran Cox adjustment The Cochran Cox adjustment relates to the t test for independent groups 1957 21 and is calculated when variances of analysed variables in both populations are different The test statistic is defined by L1 x9 sd sd2 E The test statistic has the t Student distribution with degrees of freedom proposed by Satterthwaite 1946 73 and calculated using the formula t sd sd 2 u 2 sd 2 1 Ev sd 2 1 l ni n1 1 n2 l n2
103. variables in the model Copyright 2010 2014 PQStat Software All rights reserved 231 17 MULTIDIMENSIONAL MODELS In the case of removing only one variable the results of both tests are identical If the difference between the compared models is statistically significant the value p lt a the full model is significantly better than the reduced model It means that the studied variable is not superfluous it has a significant effect on the given model and should not be removed from it e Scatter plots The charts allow a subjective evaluation of linearity of the relation among the variables and an identification of outliers Additionally scatter plots can be useful in an analysis of model residuals 17 2 3 Analysis of model residuals To obtain a correct regression model we should check the basic assumptions concerning model residu als e Outliers The study of the model residual can be a quick source of knowledge about outlier values Such observations can disturb the equation of the regression to a large extent because they have a great effect on the values of the coefficients in the equation If the given residual e deviates by more than 3 standard deviations from the mean value such an observation can be classified as an outlier A removal of an outlier can greatly enhance the model e Normality of distribution of model residuals The assumption is checked with the help of Lilliefors test A big difference between the residu
104. vocational secondary tertiary elementary elementary elementary vocational vocational vocational vocational secondary secondary secondary secondary tertiary tertiary tertiary tertiary tertiary O O 1 1 1 0 0 1 1 0 0 1 0 1 0 1 0 0 oo o o o o ogo o oOrfFrF FP Fae OOO 0 0OCOFRFRPRPRFRFOO0OO OO O PrRPrPFrPrPFOOOO 00 0 0 0 Building on the basis of dummy variables in a multiple regression model we might want to check what impact the variables have on a dependent variable e g Y the amount of earnings in thou sands of PLN As a result of such an analysis we will obtain sample coefficients for each dummy variable for sex the statistically significant coefficient b 0 5 which means that average women s wages are a half of a thousand PLN lower than men s wages assuming that all other variables in the model remain unchanged for vocational education the statistically significant coefficient 6 0 6 which means that the average wages of people with elementary education are 0 6 of a thousand PLN higher than those of people with elementary education assuming that all other variables in the model remain un changed for secondary education the statistically significant coefficient b 1 which means that the av erage wages of people with secondary education are a thousand PLN higher than those of people with elementary education assuming that all other variables in the model remain unchanged
105. we want to check what is the optimal cut off above which we can consider the study to be positive detecting bacteremia In order to check if PTC is really useful for diagnosing bacteremia we will calculate the size of the area under the ROC curve and verify the hypothesis that Ho area under the constructed ROC curve 0 5 H area under the constructed ROC curve 0 5 As bacteremia is accompanied by an increased PCT level in the test options window we will consider the indicator to be a stimulant In the state variable we have to define which value in the bacteremia column determines its presence then we select yes Apart from the result of the statistical test in the report we can find an exact description of every possible cut off Analysis time Analysed variables PCT bacteremia Count of missing data 3 Significance level Size 133 Size STATE yes 33 Size STATE no 100 Direction of diagnostic variable stimulant Prevalence 0 248120301 95 CI 0 1773568632 95 CI 0330437258 DeLong s method AUC 0 669242424 SE AUC 0 048124092 95 CI 0 794920921 95 CI 0 9835639 zZ statistic 6 691374693 p value 0 000000001 Foor s of too o o o af ofzaeizoa naf af nalozasrzos o02 s E E E T E _ooz7oe7es E a T E E 0 E ET ooo a of oa ae o ol a ogese0es _0 08 02602626 09 2 06860100 3367003 0 3082706 oos a of eo s2 11 _a o 0686965 0 22 2644620 0016666 1 0695471 0 2754020 0 3233062 eos a of ee s2
106. 0 2 the distance between the objects equals O if and only if the objects are identical d x1 2 lt 7 Tj 3 the distance must be symmetrical i e the distance from the object x to x2 must be the same as from the object x2 to z1 d x y d y x 4 the distance must fulfill the conditions of the triangle inequality d x z lt x y d y z Note The metrics ought to be calculated for characteristics with the same range of values Otherwise the characteristics with higher ranges would have a greater influence on the obtained similarity result than those with lower ones For example when calculating the similarity of people we can base the calcula tion on such features as weight or age Then the weight in kilograms in the range from 40 to 150 kg will have a greater influence on the result than age in the range of 18 to 90 years For the influence of all characteristics on the obtained similarity result to be balanced we ought to normalize standardize each of them before commencing the analysis If we want to decide on the degree of that influence by ourselves we should enter our own weights selecting the type of the metric after the standardization Distance Metric Euclidean When we talk about distance without defining its type we assume that it is the Euclidean distance the most popular type of distance constituting a natural element of models of the real world The Euclidean distance is a metric described by
107. 1 HOW TO ORGANISE DATA The way of data organisation depends on the statistic procedures that a user wants to follow Statistic analysis of data may be done on the basis of data gathered in a contingency table or as a raw data But it is also possible to convert data e from a contingency table into a raw form you can do this selecting Create raw data from Data menu e from a raw form into a contingency table you can do this selecting Create table from Data menu 1 Datain raw records form are the data organised in the way so that each row includes information about another studied object like a patient a firm etc EXAMPLE 4 1 Raw data sex education paqs file ee POStat v 14 0 C Program Files PQStat Dane EN_sex education pqs File Edit Data Statistics Spatial analysis Help Toia el Gf Aas 0 eee A EN_sex education a Rx 1C raw data contingency table secondary 2 male onmar vocational 3 male higher d male secondary 5 male higher 6 female primar vocational 7 male priman vocational amp female higher 9 female higher 10 female secondary 11 male priman vocational 12 male secondary 13 female higher 14 male priman vocational i 2 The contingency table presents a joint distribution of 2 variables There are observed frequencies natural numbers inside the table EXAMPLE 4 2 A contingency table sex education pas file Copyright 2010 2014 PQStat Software All r
108. 111604 0 14459 0 02677 2 177035 0 467448 0 252918 0 039761 2 159077 1 04020t 0 267784 0 016676 2 318364 0 13263 0 09344 0 13303 2 211044 0 726243 0 23014 0 002417 2 624309 0 958296 0 18019 0 01915 2 191399 1 85384 1 47139 0 19408 In order to be able to use the two initial components instead of the previous four original values we copy and paste them into the data sheet Now the researcher can conduct the further statistics on two new uncorrelated variables Analysis of the graphs of the two initial components The analysis of the graphs not only leads the researcher to the same conclusions as the analysis of the tables but will also give him or her the opportunity to evaluate the results more closely Factor loadings graph f 0 5 0 0 5 1 Factor 72 96 The graph shows the two first principal components which represent 72 96 of the variance and 22 85 of the variance together amounting to 95 81 of the variance of original values Copyright 2010 2014 PQStat Software All rights reserved 273 18 DIMENSION REDUCTION AND GROUPING The vectors representing original values almost reach the rim of the unit circle a circle with the radius of 1 which means they are all well represented by the two initial principal components which form the coordinate system The angle between the vectors illustrating the length of the petal the width of the petal and the length of the sepal is small which means those var
109. 14 PQStat Software All rights reserved 232 17 MULTIDIMENSIONAL MODELS If4 d gt dy the errors are not negatively correlated IfdLa lt 4 d lt dy the test result is ambiguous The critical values of the Durbin Watson test for the significance level a 0 05 are on the website www pqstat com the source of the Savin and White tables 1977 74 17 2 4 Prediction on the basis of the model Most often the last stage of regression analysis is the use of the constructed and verified model for prediction Predicting the value of the dependent variable is possible for the studied values of indepen dent variables The computed value is estimated with a certain error That is why additionally limits resulting from error are estimated for the estimated value e for the expected value confidence limits are estimated e fora single point prediction limits are estimated EXAMPLE 17 1 publisher pqs file A certain book publisher wanted to learn how was gross profit from sales influenced by such variables as production cost advertising costs direct promotion cost the sum of discounts made and the author s popularity For that purpose he analyzed 40 titles published during the previous year A part of the data is presented in the image below 1 56 749 J 0 38 1 8 2 63 10 1 10 0 59 24 0 3 EF 3 i 0 7 1 7 0 d 35 6 3 0 21 2 6 1 5 34 6 6 2 1 0 13 2 2 0 6 43 10 7 1 0 08 2 1 1 7 14 2T 0 7 0 06 0 3 0 a 63 5 12
110. 2 dinners pas file We would like to get to know if the number of dinners served in some school canteen within a given frame of time from Monday to Friday is statistically the same To do this there was taken a one week sample and written the number of served dinners in the particular days Monday 33 Tuesday 29 Wednesday 32 Thursday 36 Friday 20 As a result there were 150 dinners served in this canteen within a week 5 days We assume that the probability of serving dinner each day is exactly the same so it comes to E The expected frequencies of served dinners for each day of the week out of 5 comes to 150 30 Copyright 2010 2014 PQStat Software All rights reserved 95 10 COMPARISON 1 GROUP SA da 33 30 Tuesday 29 30 Wednesday 32 30 Thursday 36 30 Friday 20 30 Hypotheses Ho the number of served dinners in the analysed school canteen within given days of the week is consistent with the expected number of given out dinners these days H the number of served out dinners in the analysed school canteen within a given week is not consistent with the expected number of dinners given out these days Analysis time Analysed variables number of served dinners expected number of se Significance level 0 05 Chi square statistic Degrees of freedom p value 0 287297 7 P Observed frequencies B Eqected frequencies Frequency tn th 3 Monday Tuesday Wednes
111. 4 Norm Distance of the district O Jaccard Tanimoto 15 Norm Proximity of a bus or tra d 16 Varl Modification for Euclidean only from selected rows no selection Options for governments Replace empty cells To build the second matrix we choose in the matrix window the same settings as in the case of the first matrix with the difference that now we additionally select the button Modification Euclidean and we enter greater weights for the Number of rooms and the Distance of the district center in the modi fication window For example their values could be equal to 10 and for the remaining characteristics the values could be smaller e g equal to 1 Metric modification for Euclidean Inverse Weight C Square 11 Norm No of the r 10 12 Norm Floor on wk 1 13 Norm Age of the H1 14 MarniNictanre nf 1N N oh Tik Do k 1 As a result we will obtain two matrices In each of them the first column concerns the similarity to the flat looked for by the client Copyright 2010 2014 PQStat Software All rights reserved 33 3 WORKING WITH DOCUMENTS Eucidean Wanted T Weighted eucidean Wanted Wanted Wanted Flat 10 Flat 10 Flat 12 Flat 12 Flat 17 Flat 17 Flat 35 Flat 35 Flat 88 Flat 88 Flat 101 Flat 101 Flat 105 Flat 105 Flat 122 i Flat 122 Flat 130 Flat 130 Flat 132 Flat 132 Flat 135 Flat 135 According to the u
112. 4 PQStat Software All rights reserved 75 8 PROBABILITY DISTRIBUTIONS sl e Chi square x distribution this is a right skewed distribution with a shape depending on the number of degrees of freedom df The higher the number of degrees of freedom the more similar the shape of x distribution to the normal distribution Density function is defined by 1 x f x df nz te 3 22Te where x gt 00 df degrees of freedom sample size is decreased by the number of limitations in given calculations I is a Gamma function e Fisher Snedecor distribution this is a distribution which has a right tail that is longer and a shape that depends on the number of degrees of freedom df and dfo Copyright 2010 2014 PQStat Software All rights reserved 76 A density function is defined by where gt 0 8 PROBABILITY DISTRIBUTIONS sl dfix Vi dy df1x df2 1 42 F x df df2 olai di sB df df degrees of freedom it is assumed that if X i Y are independent with a 7 distribution with adequately df and df degrees of freedom than F X dfi Snedecor distribution F df1 df2 B is a Beta function 0 7 0 6 0 5 0 4 0 3 0 2 0 1 l l l l I j l F dfy 12 dfo 3 m ee Fah 1 dfz 1 Rat ea t 1 2 Copyright 2010 2014 PQStat Software All rights reserved Y dfe hasaF F dfi 3 df2 12 ts es SEES m n
113. 41 4 1 HOW TO ORGANISE DATA 2 1 41 4 2 HOW TO REDUCE A DATASHEET WORKSPACE 2 2 ee ee 43 4 3 MULTIPLE REPEATED ANALYSIS 0 47 4 4 INFORMATION GIVENINAREPORT 2 0 ee ee ee 47 4 5 MARKING OF STATISTICALLY SIGNIFICANT RESULTS 2 1 ee ee 47 5 GRAPHS 48 S1 GRAPMSGALLERY sac obese hee OOOO we AES SSS EME SSS oe Se aa 48 Ok BDO gee ae Steed Geo ER Oe See he Ee hea ee wR Ewe 48 Biles ERIOP DIOS 5 6c 440 4 4 SEES EES EES ERE Shwe oe Rw 53 5 1 3 Box Whiskers plots 2 ee 55 S14 Scatterplois 26448 2 eed hee Owe oe bw A eS Oe RO Oe eS Be ewe EG od 56 S15 UNSC 6 6 64 BASE e ORE Ee RO eR AK Awe be EH EHH eS 58 6 FREQUENCY TABLES AND EMPIRICAL DATA DISTRIBUTION 60 7 DESCRIPTIVE STATISTICS 65 7 1 MEASUREMENT SCALES 0 ee ee ee 65 7 2 MEASURES OF POSITION LOCATION 2 1 2 2 ee 67 7 2 1 CENTRAL TENDENCY MEASURES 2 0 ee eee eee es 67 7 2 2 ANOTHER MEASURES OF POSITION 2 2 ee ee 68 7 3 MEASURES OF VARIABILITY DISPERSION ooon 69 7 4 ANOTHER DISTRIBUTION CHARACTERISTICS 0 2 0 eee ee ee 70 8 PROBABILITY DISTRIBUTIONS 73 8 1 CONTINUOUS PROBABILITY DISTRIBUTIONS 0 0 0 eee ee ee ee ee 75 8 2 PROBABILITY DISTRIBUTION CALCULATOR 2 2 78 9 HYPOTHESES TESTING 81 9 0 1 POINT AND INTERVAL ESTIMATION 21 2 a ee 81 9 0 2 VERIFICATION OF STATISTICAL HYPOTHESES 2 a ewe he eee ee eee we ee ew 81 CONTENTS 10 COMPARISON 1 GROUP 84 10 1 PARAMET
114. 45 0 366438 0 0645 0 06707 tolerance 0 639209 0 637949 0 948099 0 987951 0 360791 0 362051 0 051901 0 012049 3 106033 5 965106 0 974245 1 013069 0 000013 0 000003 0 336816 0 516163 0 53049 0 241585 0 655309 0 14691 3 649047 0 000874 popular_author The values of coefficients of partial and semipartial correlation indicate that the smallest contribution into the constructed model is that of direct promotion costs and the sum of discounts made However those variables are the least correlated with model residuals which is indicated by the low value R and the high tolerance value Allin all from the statistical point of view models without those variables would not be worse than the current model see the result of t test for model comparison The decision Copyright 2010 2014 PQStat Software All rights reserved 235 Yi 17 MULTIDIMENSIONAL MODELS about whether or not to leave that model or to construct a new one without the direct promotion costs and the sum of discounts made belongs to the researcher We will leave the current model Finally we will analyze the residuals A part of that analysis is presented below woo w om on W wN e H H fA 12 13 14 15 16 17 18 19 20 71 predicted 5 56 87826 56 190571 31 532115 40 368438 29 010008 47 088847 13 194911 49 928477 46 501478 47 776 47 47 989623 79 536147 29 921482 30 206017 46 840918 40 467863 47 110045 50 19
115. 5 0 56 1 7 0 The first five variables are expressed in thousands fo dollars so they are variables gathered on an in terval scale The last variable the author s popularity is a dychotomic variable where 1 stands for a known author and O stands for an unknown author On the basis of the knowledge gained from the analysis the publisher wants to predict the gross profit from the next published book written by a known author The expenses the publisher will bear are production cost 11 advertising costs 13 direct promotion costs 0 5 the sum of discounts made x 0 5 We construct the model of multiple linear regression selecting gross profit as the dependent vari able Y production cost advertising costs direct promotion costs the sum of discounts made the author s popularity as the independent variables X1 X2 X3 X4 X5 As a result the coefficients of the regression equation will be estimated together with measures which will allow the evaluation of the quality of the model Copyright 2010 2014 PQStat Software All rights reserved 233 i 17 MULTIDIMENSIONAL MODELS sl Analysis time Analysed variables Significance level gross_profit prod_cradvert_c 0 05 Number of estimated parameters 6 R 0 922483 Ra 0 850974 Adjusted R2 0 829059 Standard error of estimation 8 086501 2223 351121 14918 99775 12695 686654 Residual sum of squares Total sum of squares Explained sum of squares 38 8297
116. 507 2 855092 95 CI 3 211267 0 168301 0 038941 0 13637 2 753195 95 CI 10 595997 0 429769 0 429751 0 03898 0 929857 0 035282 0 992908 95 CI 10 352936 0 044389 0 969178 0 031166 0 942268 Wald stat 14 948697 1 012102 1 015501 10 200921 3 599199 10 4853921 16 402912 Wald stat 13 857511 11 317897 4 511625 9 741995 15 99675 p value 0 00011 0 3144 0 313589 0 001404 0 059359 0 001204 0 000051 p value 0 000197 0 000765 0 033665 0 001801 0 000065 Odds Ratic 1381 0525 0 635564 0 634582 0 904027 1 577637 0 914484 0 146022 Odds Ratic 661 91991 0 699114 1 655426 0 919644 0 157594 Copyright 2010 2014 PQStat Software All rights reserved 2 0 305129 95 CI 35 346726 535959 905 0 262629 0 262022 0 849751 0 982161 0 866314 0 057551 95 LI 24 81049 0 845099 1 039709 0 872519 0 063724 1 536902 1 536875 0 96177 2 534146 0 965333 0 370498 95 CI 31348 947 0 956562 2 635777 0 969315 0 389743 262 17 MULTIDIMENSIONAL MODELS The results of the Likelihood Ratio test p 0 3051 indicates that there is no basis for believing that a full model is better than a reduced one Therefore with a slight worsening of model ade quacy the address of residence and the sex can be omitted Note The comparison of both models with respect to their ability to classify can be made by comparing ROC curves for those models For that purpose we u
117. 59 2 567479 20 712438 127 371291 2 915161 6 663349 4 027156 p value 0 016325 0 787754 0 010409 0 024485 0 070256 0 003014 0 109081 0 000001 0 000436 0 047652 0 008798 0 028015 odds ratio 1 206283 0 954492 1 549783 0 743835 0 96648 1 349394 0 611555 4 427393 4 29709 0 505459 0 416308 0 455477 Copyright 2010 2014 PQStat Software All rights reserved 95 CI 1 356231 0 680014 1 108427 0 574768 0 931454 1 10701 0 335124 2 069235 1 906976 0 257142 0 217934 0 223963 20 574401 1 339761 2 166879 0 962633 1 002824 1 644848 1 116004 7 629454 9 662649 0 993566 0 802912 0 916194 253 STAT 17 MULTIDIMENSIONAL MODELS sl Analysis time Analysed variables Count of missing data Significance level Size Number of estimated parameters Frequency 0 control Frequency 1 case Likelihood ratio test Log Likelihood 416 06069 2 Log Likelihood 832 12138 Log Likelihood intercept 469 509235 2 Log Likelihood intercept 939 618471 Chi square statistic 107 49709 Degrees of freedom 11 p value lt 0 000001 Pseudo R2 0 114405 R2 Nagelkerke 0 195521 R2 Coma Snella 0 14662 Hosmer Lemeshow test Chi square statistic 6 720908 Degrees of freedom 8 p value 0 567022 As a result the variables which describe education become statistically significant The goodness of fit of the model does not change much but the manner of interpretation of the the odds ratio for education
118. 72 lt 0 000001 p value b coeff 4 175166 2 060709 1 998235 4 666238 1 423171 b error 93 CI 95 CI 4 772779 5 524268 13 874639 0 501507 1 541525 0 359065 1 268527 4 791644 5 069555 1 404611 1 431749 3 579894 2 727943 14 40603 4 278091 t stat 0 874791 3 106033 3 965106 0 974245 1 013069 p value 0 367825 0 000013 0 000005 0 336816 0 318183 b stand b stand er 0 422616 0 461267 0 066242 0 067478 0 082807 0 082889 0 067993 0 066606 mean 5 4 515046 0 277534 0 927345 popular_author 10 153717 2 782567 4 498661 15 808574 3 649047 0 000874 0 261561 0 071679 0 503831 On the basis of the estimated value of the coefficient b the relationship between gross profit and all independent variables can be described by means of the equation profitgross 4 18 2 56 Cprod 2 Cady 4 67 Cprom 1 42 discounts 10 15 populauthor 8 09 The obtained coefficients are interpreted in the following manner e f the production cost increases by 1 thousand dollars then gross profit will increase by about 2 56 thousand dollars assuming that the remaining variables do not change e f the production cost increases by 1 thousand dollars then gross profit will increase by about 2 thousand dollars assuming that the remaining variables do not change e f the production cost increases by 1 thousand dollars then gross profit will increase by about 4 67 thousand dollars assuming that the remai
119. 7864 43 035765 44 211462 2790 ASHANA residual 1 12174 6 809429 4 532115 5 368435 4 969992 0 911153 0 805089 13 571523 3 998522 2 220200 2 209623 2 463853 4 078518 0 793983 3 159082 36 46786 6 669955 7 697864 1 964235 4 766536 5 ASFANA standard r 0 138718 0 842074 0 560454 0 663877 0 617077 0 112676 0 09956 1 678293 0 494469 0 274934 0 283141 0 304687 0 504361 0 098156 0 390661 4 509721 0 852032 0 95194 0 242903 0 592164 0 AFAFA lt 3sd S3sd 2sd 2sd sd sd sd E sd 2sd 2sd 3sd gt 3sd It is noticeable that one of the model residuals is an outlier it deviates by more than 3 standard deviations from the mean value It is observation number 16 The observation can be easily found by drawing a chart of residuals with respect to observed or expected values of the variable Y Residuals 10 y 0 000 x 00 000 40 Bh Predicted value Copyright 2010 2014 PQStat Software All rights reserved 236 17 MULTIDIMENSIONAL MODELS y 6 745 x 00 149 Residuals Observed value That outlier undermines the assumption concerning homoscedasticity The assumption of homoscedas ticity would be confirmed that is residuals variance presented on the axis Y would be similar when we move along the axis X if we rejected that point Additionally the distribution of residuals deviates slightly from normal distribution the value
120. 879 0 4467169 0 9272388 2 6783371 16 286373 0 0000544 6 0665375 2 5275205 14 560859 Rx 22 3349391 1 6610211 0 939801 53 6496801 1 9625151 0 1612445 10 537488 0 3907052 284 20054 log WBC F 0 3421951 0 5197406 1 360867 0 6764778 0 4334850 0 5102838 The variable which informs about the interaction of Rx and log WBC included in model C is not significant in model C according to the Wald test Thus we can view further consideration of the inter actions of the two variables in the model to be unnecessary We will obtain similar results by comparing with the use of a likelihood ratio test model C with model B We can make the comparison by choos ing the Cox PH regression comparing models menu We will then obtain a non significant result p 0 5134 which means that model C model with interaction is NOT significantly better than model B model without interaction Chi square models comparison 0 427079874 Degrees of freedom 1 p value 0 515425299 Therefore we reject model C and move to consider model B and model A HR for Rx in model B is 3 65 which means that hazard for the placebo group is about 3 6 greater than for the patients undergoing treatment Model A only contains the Rx variable which is why it is usually called a crude model it ignores the effect of potential confounding factors In that model the HR for Rx is 4 52 and is much greater than in model B However let us look not only at the point val
121. 9 5 GRAPHS SA x0 E Group proportion T l l l l l l l l l Copyright 2010 2014 PQStat Software All rights reserved 54 5 GRAPHS 5 1 3 Box Whiskers plots before during after Phases of the campaign Copyright 2010 2014 PQStat Software All rights reserved 55 5 GRAPHS Median 0103 L Min Max bl b2 b3 b4 b b bl bll 5 1 4 Scatter plots yl 1 605 x 0 000 y2 1 512 x 0 002 height Copyright 2010 2014 PQStat Software All rights reserved 56 Fuel consumption 10 y 0 902 x 1 665 0 2 0 4 0 6 0 8 1 Mass ton y 7 981 x 0 718 10 2 10 4 10 6 10 8 11 Copyright 2010 2014 PQStat Software All rights reserved xy power cy linear 5 GRAPHS 57 STM WH 5 GRAPHS Bland Altman Plot ADELT i l l l ae _ m m m THI 1 055 1 965D mean PV1 PV2 5 1 5 Line plots 58 Copyright 2010 2014 PQStat Software All rights reserved ranks Dance couple A Dance couple Dance couple E judges Copyright 2010 2014 PQStat Software All rights reserved 5 GRAPHS D xy objects 59 6 FREQUENCY TABLES AND EMPIRICAL DATA DISTRIBUTION 6 FREQUENCY TABLES AND EMPIRICAL DATA DISTRIBUTION The basis of all statistical analyses is to define an empiric
122. 910 Correlation calculated from faulty data British Journal of Psychology 3 271 295 76 Tarone R E Ware J 1977 On distribution free tests for equality of survival distributions Biomet rica 64 1 156 160 77 Tarone R E 1985 On heterogeneity tests based on efficient scores Biometrika 72 91 95 78 Volinsky C T Raftery A E 2000 Bayesian information criterion for censored survival models Biometrics 56 1 256 262 Copyright 2010 2014 PQStat Software All rights reserved 317 K DA REFERENCES WW 79 Wallenstein S 1997 A non iterative accurate asymptotic confidence interval for the difference between two Proportions Statistics in Medicine 16 1329 1336 80 Wallis W A 1939 The correlation ratio for ranked data Journal of the American Statistical Asso ciation 34 533 538 81 Wilcoxon F 1945 Individual comparisons by ranking methods Biometries 1 80 83 82 Wilcoxon F 1945 Individual comparisons by ranking methods Biometries 1 80 83 83 Wilcoxon F 1949 Some rapid approximate statistical procedures Stamford CT Stamford Re search Laboratories American Cyanamid Corporation 84 Wilcoxon F 1949 Some rapid approximate statistical procedures Stamford CT Stamford Re search Laboratories American Cyanamid Corporation 85 Wilcoxon F 1949 Some rapid approximate statistical procedures Stamford CT Stamford Re search Laboratories American Cyanamid Corporation
123. 95 the higher the birth weight the smaller the odds of the occurrence of the anomaly in a child e variable PregNo OR 95 C I 1 34 1 10 1 63 the odds of the occurrence of the anomaly in a child is 1 34 times greater with each subsequent pregnancy e variable RespTiInf OR 95 C I 4 46 2 59 7 69 the odds of the occurrence of the anomaly in a child if the mother had a respiratory tract infection during the pregnancy is 4 46 times greater than in a mother who did not have such an infection during the pregnancy e variable Smoking OR 95 C I 4 44 1 98 9 96 a mother who smokes when pregnant in creases the risk of the occurrence of the anomaly in her child 4 44 times In the case of statistically insignificant variables the confidence interval for the Odds Ratio contains 1 which means that the variables neither increase nor decrease the odds of the occurrence of the studied anomaly Therefore we cannot interpret the obtained ratio in a manner similar to the case of statisti cally significant variables The influence of particular independent variables on the occurrence of the anomaly can also be de scribed with the help of a chart concerning the odds ratio 3540 MEdu E Odds Ratio Resp TInt SponAbort Prego MAge Birth Weight Address Res Copyright 2010 2014 PQStat Software All rights reserved 252 Note 17 MULTIDIMENSIONAL MODELS An independent variable with a few categories can be considered in t
124. 958 Spearman Brown split half reliability Coefficient is 0 857705 Guttman split half reliability coefficient is 0 856531 The halves are well correlated the correlation coefficient is 0 750862 However the value of Cronbach alpha coefficient is too low for the second half 0 416958 This half includes the KK7 item which shows a weak correlation with the other scale items Removing the item and repeating the analysis all the items are really high and reliable Copyright 2010 2014 PQStat Software All rights reserved 309 Analysis time Analysed variables Significance level Group size Mean of scale Standard deviation of scale Correlation between two halves of scale Split half reliability Standard error of measurement Guttman split half reliability Firsh half Number of items Names of items Mean Standard deviation Cronbach Alpha Second half Number of items Names of items Mean Standard deviation Cronbach Alpha 20 RELIABILITY ANALYSIS KK1 KK3S KK5 KK2 KK4 KK6 0 05 24 23 6066667 5 73067 0 822933 0 902867 1 766032 0 902251 3 KK1 KK3 KK5 11 625 3 076029 0 607122 3 KKZ KKA KKO 12 041667 2 92633 0 566416 Copyright 2010 2014 PQStat Software All rights reserved 310 21 THE WIZARD 21 THE WIZARD The Wizard is a tool which makes the navigation easier to go through the basic statistics included in an application especially for a novice user It includes sugge
125. 981102 1 6999158 hospital 2 1 lt gt 3 0 3203277 0 0720764 1 4236263 hospital 2 2 lt gt 3 0 3893849 0 0902500 1 6800069 Common 1 lt gt 2 0 7337077 0 3810310 1 4128167 Common 1 lt gt 3 0 5257480 0 2631977 1 0502028 Common 0 7165633 0 3755802 1 3671193 19 SURVIVAL ANALYSIS we can localize significant differences For the comparison of the curve of the youngest group with the curve of the oldest group the hazard ratio is the smallest 0 53 the 95 confidence interval for that ratio 0 26 1 05 does contain value 1 but is on the verge of that value which can suggest that there are significant differences between the respective curves In order to confirm that supposition an inquisitive researcher can with the use of the data filter in the analysis window compare the curves in pairs However it ought to be remembered that one of the corrections for multiple comparisons should be used and the significance level should be modified In this case for Bonferroni s correction with three comparisons the significance level will be 0 017 For simplicity we will only avail ourselves of the log rank test 45 years 50 years vs 50 years 55 years Common for stratas Chi square statistic Degrees of freedom p value 45 years 50 years vs 55 years 60 years Common for stratas Chi square statistic Degrees of freedom p value 50 years 55 years vs 55 years 60 years Common for strata
126. 994 Statistical Methods in Medical Research 3rd edition Blackwell 6 Barnard G A 1989 On alleged gains in power from lower p values Statistics in Medicine 8 1469 1477 7 Led Beal S L 1987 Asymptotic confidence intervals for the difference between two binomial param eters for use with small samples Biometrics 43 941 950 8 Bender R 2001 Calculating confidence intervals for the number needed to treat Controlled Clin ical Trials 22 102 110 9 Betty R Kirkwood and Jonathan A C Sterne 2003 Medical Statistics 2nd ed Meassachusetts Blackwell Science 177 188 240 248 10 BlandJ M Altman D G 1986 Statistical methods for assessing agreement between two methods of clinical measurement Lancet 327 8476 307 10 11 Bowker A H 1948 Test for symmetry in contingency tables Journal of the American Statistical Association 43 572 574 12 Breslow N E Day N E 1980 Statistical Methods in Cancer Research Vol I The Analysis of Case Control Studies Lyon International Agency for Research on Cancer 13 Breslow N E 1996 Statistics in epidemiology the case control study Journal of the American Statistical Association 91 14 28 14 Breslow N E 1974 Covariance analysis of censored survival data Biometrics 30 1 89 99 15 Brown L D Cai T T DasGupta A 2001 Interval Estimation for a Binomial Proportion Statistical Science Vol 16 no 2 101 133
127. AMPLE 18 1 file iris pqs That classical set of data was first published in Ronald Aylmer Fisher s 1936 29 work in which discriminant analysis was presented The file contains the measurements in centimeters of the length and width of the petals and sepals for 3 species of irises The studied species are setosa versicolor and virginica It is interesting how the species can be distinguished on the basis of the obtained measurements b a Sctosa Versicolor Virginica The photos are from scientific paper Lee et al 2006r Application of a noisy data classification technique to determine the occurrence of flashover in compartment fires Principal component analysis will allow us to point to those measurements the length and the width of the petals and sepals which give the researcher the most information about the observed flowers The first stage of work done even before defining and analyzing principal components is checking the advisability of conducting the analysis We start then from defining a correlation matrix of the variables and analyzing the obtained correlations with the use of Bartlett s test and the KMO coefficient 1 0 11757 0 871754 0 817941 Sepal Wic 0 11757 1 0 42844 0 366171 0 871754 0 42844 1 0 962865 Petal Wid 0 817941 0 366171 0 962865 1 0 5 s5ec Analysed variables Sepal Length Sepal Width Significance level 0 05 Analysis of correlation matrix Bartlett test Chi square sta
128. Bowker test of internal symmetry can be calculated on the basis of either raw data or a c x c contingency table Table 11 6 c x c contingency table for the observed frequencies of dependent variables Observed frequencies e T ou C si T ini Oia e Xiz Oie n ae Qij Hypotheses Ho Oij Oji H Oj 4 Oji for at least one pair O Oji where j i j 1 2 7 1 2 So Oj and Oj are the frequencies of the symmetrical pairs in the c x c table Copyright 2010 2014 PQStat Software All rights reserved 138 STM 11 COMPARISON 2 GROUPS The test statistic is defined by yy Ge i l j gt i Qij Oji This statistic asymptotically for large sample size has the x distribution with a number of degrees of freedom calculated using the formula df oe The p value designated on the basis of the test statistic is compared with the significance level a ifp lt a gt reject Ho and accept H ifp gt a gt _ thereis no reason to reject Ho EXAMPLE 11 9 opinion pas file Two different surveys were carried out They were supposed to analyse students opinions about the particular academic professor Both the surveys enabled students to give a positive opinion a negative and a neutral one Both surveys were carried out on the basis of the same sample of 250 students But the first one was carried out the day before an exam done by the professor and the other survey the day after the
129. Copyright 2010 2014 PQStat Software All rights reserved 238 17 MULTIDIMENSIONAL MODELS sl The predicted profit is 72 thousand dollars Note To be able to consider the nominal independent variable in many categories in the model the variable ought to be decomposed into several dummy variables in 2 categories before the analysis Note To take into consideration the interactions of independent variables a variable which is the result of multiplying the variables participating in the interaction ought to be introduced into the model Copyright 2010 2014 PQStat Software All rights reserved 239 17 MULTIDIMENSIONAL MODELS 17 3 COMPARISON OF MULTIPLE LINEAR REGRESSION MODELS The window with settings for model comparison is accessed via the menu Statistics Multidimensional models Multiple regression model comparison Multiple Regression Comparing Models Statistical analysis Comparing Multiple Regression Models w Vanable Y Vanable X1 2 full model Gtebates Data Filter Set of the conditions that are applied to data to produce a subset of your data All the rules are combined using the logical AND basic mlipe W Report options Kind of comparison Add analysed data manual Add graph ee ee The multiple linear regression offers the possibility of simultaneous analysis of many independent vari ables There appears then the problem of choosing the optimum mo
130. Diagnostic varable 2 1 WBC 2 PCT 4 bacteremia i Data Filter Set of the conditions that are applied to data to C3 produce a subset of your data All the rules are combined using the logical AND basic mutipe W Report options C Add analysed data T7 Add craph a or Copyright 2010 2014 PQStat Software All rights reserved 218 16 DIAGNOSTIC TESTS oO e Independent model the compared ROC curves are constructed on the basis of measurements made on different objects Hypotheses Ho AUC AUC H AUC 4 AUCs Test statistics Hanley J A and McNeil M D 1983 40 has the form 7 _ AUC1 AUC SE4uc SE ave where AUC AUC and standard errors of areas SE 4uc SE Auc are calculated on the basis of nonparametric method DeLong DeLong E R et al 1988 26 Hanley J A and Hajian Tilaki K O 1997 38 recommended x nonparametric Hanley McNeil method Hanley J A and McNeil M D 1982 39 x method which presumes double negative exponential distribution Hanley J A and McNeil M D 1982 39 computed only when groups and are equinu merous Statistics Z has for large sizes asymptotic normal distribution On the basis of test statistics p value is estimated and then compared with the significance level Q fp lt a we reject Ho and accept H1 fp gt a gt there is no basis for rejecting Ho The window with settings for comparing independent ROC curves is acce
131. E 11 4 sex education pas file There is a sample which consists of 34 persons n 34 You need to analyse the 2 features of these persons X sex Y education Sex occurrs in 2 categories X woman X2 man education occurrs in 3 categories Y primary vocational Yo secondary Y3 higher In case of the raw data when you open the window with the options for the test for example the y test for C x R table the option raw data will be automatically selected Chi square Rx Statistical analysis Chi square test RxC Variable 1 Variable 2 Ina 2 3eC education Data Filter set of the conditions that are applied to data to pJ produce a subset of your data All the rules are combined using the logical AND basic mutiple WD Report options Add analysed data L Add graph V Add percentages In case of the data gathered in a contingency table it is worth to select this data the values numbers without headings before you open the above mentioned window Doing it and opening the window the contingency table will be automatically selected and all the data from the selection will be shown to you Copyright 2010 2014 PQStat Software All rights reserved 115 11 COMPARISON 2 GROUPS oO Chi square Rx Statistical analysis Chi square test Rial Contingency table Report options Add analysed data L Add graph Add percentages i j ance In the test window you can
132. ELS sl The Odds Ratio for variable X is then expressed with the formula _ Odds 2 _ e80th1 1 2 h2Xor 8 Xp OR 2 1 Odds 1 e o tb1X 1 82X2 8kXk ebo b1X1 2 82X2 8k Xk bo 81 X1 1 82X2 8k Xk e 1X1 2 21 X10 e 1lX1 2 X10 CU Example If the independent variable is age expressed in years then the difference between neighboring age categories such as 25 and 26 years is 1 year X1 2 X1 1 26 25 1 In such a case we will obtain the individual Odds Ratio 1 OR e which expresses the degree of change of the odds for the occurrence of the distinguished value if the age is changed by 1 year The odds ratio calculated for non neighboring variable categories such as 25 and 30 years will be a five year Odds Ratio because the difference X1 2 X1 1 30 25 5 In such a case we will obtain the five year Odds Ratio OR Ay which expresses the degree of change of the odds for the occurrence of the distinguished value if the age is changed by 5 years Note If the analysis is made for a non linear model or if interaction is taken into account then on the basis of a general formula we can calculate an appropriate Odds Ratio by changing the formula which expresses Z 17 4 2 Model verification Statistical significance of particular variables in the model significance of the Odds Ratio On the basis of the coefficient and its error of estimation we can i
133. ESES TESTING sl you to choose a more probable hypothesis null or alternative But you always need to assume if a null hypothesis is the right one and all the proofs gathered as a data are supposed to supply you with the enough number of counterarguments to the hypothesis fp lt a gt reject Ho and accept H ifp gt a gt there is no reason to reject Ho There is usually chosen significance level a 0 05 accepting that for 5 of the situations we will reject a null hypothesis if there is the right one In specific cases you can choose other significance level for example 0 01 or 0 001 Note Note that a statistical test may not be compatible with the reality in two cases ea Sa true Ho false Ho false a a R Ho false a OK We may make two kinds of mistakes ist type of error probability of rejecting hypothesis Ho when it is the right one 6 2nd type of error probability of accepting hypothesis Ho when it is the wrong one Power of the test is 1 2 Values a and 6 are connected with each other The approved practice is to assume the significance level in advance q and minimalization 8 by decreasing a sample size The 3rd step Description of results of hypotheses verification Copyright 2010 2014 PQStat Software All rights reserved 83 10 COMPARISON 1 GROUP Interval scale Ordinal scale Are the data normally distributed Wilcoxon signed ranks test Ko
134. Frequency Analysis time Data Filter kind of contract individual Cumulative Cumulative frequency percent 70 423 9 7530 6 943 0 615 4 065 Variable number of contracts Frequency EXAMPLE 6 2 fertiliser pqs file There was made an experiment in order to analyse a microbiological condition of the soil where the fertilised with biologically active fertilisers perennial ryegrass is grown The soil was fertilised with var ious microbiological specimen and fertilisers After that there was a number of microorganisms which occurred in a 1 gram of dry mass of calculated soil You want to get to know the frequency of actino mycetes occurrence in a 1 gram of dry mass of the soil fertilised with nitrogen You want to find out how often in the analysed sample values of actinomycetes had occurred in intervals from O to 20 from over 20 to 40 from over 40 to 60 You need to select only the 54 first rows in a datasheet which fulfil the analysis Assumptions there are actinomycetes fertilised with nitrogen and then to open a frequency tables window in Statistic menu Frequency tables In the options window you need to select a variable which you want to analyse The number of mi croorganisms After that you need to set ranges classes so that the start value is O and the step value is 20 At the top of the window you should see the message Now confirm your choice by clicking OK and you will get a result presented in th
135. IVE STATISTICS Example the mass of an object kg the area of an object m time years soeed km h etc Ordinal scale Variables are assessed on an ordinal scale if itis possible to order them so the sequence of occurred elements does matter it is impossible to define the quotient and the difference between two values in a logical way Example education competitors order on the podium etc Note Note that if a variable is assessed on an ordinal scale to enable proper calculations on it it should be written by means of numbers Numbers are a kind of agreed identifiers telling us about the order of elements Nominal scale Variables are assessed on a nominal scale if it is impossible to order them because there is no order resulting from the nature of the given occurrence it is impossible to define the quotient and the difference between two values in a logical way Example sex country of residence etc Note If a variable is assessed on a nominal scale it can be written by means of text labels Even if the values of a nominal variable are written in numbers these numbers are only a kind of agreed identifiers so it is impossible to make any arithmetical calculations based on them and it is also impossible to compare them Copyright 2010 2014 PQStat Software All rights reserved 66 7 DESCRIPTIVE STATISTICS SAI 7 2 MEASURES OF POSITION LOCATION 7 2 1 CENTRAL TENDENCY MEASURES Ce
136. If a study is a cohort study the relative risk of occurence the phenomenon is calculated for the table Usually they are prospective studies the researcher cares about experiment conditions because of the structure of an analysed phenomenon in a sample and in a population should be similar The odds ratio 2 x 2 table For the designation of odds ratio we calculate the probability of being a case in the exposed group and in the unexposed group according to the formulas O11 O11 O12 _ O11 Oy2 O1 O12 O72 odds _ Q21 021 Ora _ Oar unexposed Os Oz Fi Os On oddSexrposed a The Odds Ratio On Oi2 0102 OR O21 O22 O12021 The test of significance for the OR This test is used to the hypothesis verification about the odds of occurence the analysed phe nomenon is the same in the group of exposed and unexposed to the risk factor Hypotheses Ho OR 1 H ORF 1 The test statistic is defined by In OR SB Copyright 2010 2014 PQStat Software All rights reserved 131 11 COMPARISON 2 GROUPS SIl where SE Co o L3 On J o standard error of the In O R The test statistic asymptotically for large sample size has the normal distribution The p value designated on the basis of the test statistic is compared with the significance level Q fp lt a gt gt reject Ho and accept H ifp gt a gt _ there is no reason to reject Ho Note In the interpretation of odds ra
137. It does not inform you which populations are different from each other To gain some more detailed knowledge about the differences in particular parts of our complex structure you should use contrasts if you do the earlier planned and usually only particular comparisons or the procedures of multiple comparisons POST HOC tests when having done the analysis of variance we look for differences usually between all the pairs 12 1 2 The contrasts and the POST HOC tests The number of all the possible simple comparisons is calculated using the following formula Hypotheses The first example simple comparisons comparison of 2 selected means Ho Hi U2 Hi pa F be The second example complex comparisons comparison of combination of selected means Ho py 2H T Hi pm Z Stes If you want to define the selected hypothesis you should ascribe the contrast value c j 1 2 k to each mean The c values are selected so that their sums of compared sides are the opposite numbers and their values of means which are not analysed count to 0 The first example cy 1 co 1 c3 Q0 Ck Q The second example c1 2 co 1 c3 1 c4 O Ck O How to choose the proper hypothesis i Comparing the differences between the selected means with the critical difference CD calcu lated using the proper POST HOC test if the differences between means gt CD gt reject Ho and accept H1
138. OCO Ea y I e T ee o 3 4 77 8 PROBABILITY DISTRIBUTIONS 8 2 PROBABILITY DISTRIBUTION CALCULATOR The area under a curve density function is p probability of occurrence of all possible values of an analysed random variable The whole area under a curve comes to p 1 If you want to analyse just a part of this area you must put the border value which is called the critical value or Statistic To do this you need to open the Probability distribution calculator window In this window you can calculate not only a value of the area under the curve p value of the given distribution on the basis of Statistic but also Statistic value on the basis of p value To open the window of Probability distribution calculator you need to select Probability distribution calculator from the Statistics menu Statistic p value Statistic mean standard dew 150 161 15 13 03 Options SS C 1 p value EXAMPLE 8 1 Probability distribution calculator Some mobile network operator did the research which was supposed to show the usage of free min utes given to his clients on a pay monthly contract On the basis of the sample which consists of 200 of the above mentioned network clients where the distribution of used free minutes is of the shape of normal distribution is calculated the mean value z 161 15min and standard deviation sd 13 03min We want to calculate the probability that the chosen client used
139. OMPARISON 1 GROUP where median of an analysed feature of the population represented by the sample o a given value Now you should calculate the value of the test statistics Z T for the small sample size and based on this p value The p value designated on the basis of the test statistic is compared with the significance level a fp lt a reject Ho and accept H ifp gt a gt there is no reason to reject Ho Note Depending on the size of the sample the test statistic takes a different form forasmall sample size T min Y RY R where SR and X R_ are adequately a sum of positive and negative ranks This statistic has the Wilcoxon distribution fora large sample size T n n 1 EEE aie A n n 1 2n 1 gt gt t 24 48 where n the number of ranked signs the number of ranks t the number of cases being included in the interlinked rank The test statistic formula Z includes the correction for ties This correction should be used when ties occur when there are no ties the correction is not calculated because t 55 t 48 0 Z statistic asymptotically for a large sample size has the normal distribution Continuity correction of the Wilcoxon test Marascuilo and McSweeney 1977 60 A continuity correction is used to enable the test statistic to take in all values of real numbers accord ing to the assumption of the normal distribution Test statistic with a continu
140. PQStat Software Statistical Computational Software User Guide PQStat Barbara Wieckowska COPYRIGHT 2010 2014 PQSTAT SOFTWARE All rights reserved Version 1 4 8 P7909121213 www pqstat pl K DA Contents 1 SYSTEM REQUIREMENTS 5 2 HOW TO INSTALL 5 3 WORKING WITH DOCUMENTS 6 3 1 HOW TO WORK WITH DATASHEETS 2 ee ee ee 8 3 1 1 HOW TO ADD TO DELETE AND TO EXPORT DATASHEETS 2 08 208 8 3 1 2 HOW TO INSERT DATAINTOASHEET 2 2 ee ee 8 3 1 3 DATASHEET WINDOW cb be eee week ed wae wee eww eh Eee eww eee ds 10 Sige CELLSFORMAT 24 bee Gee eee oa Pee ee Se eee nae Ee wes 11 3 1 5 DATAEDITING 2 245624 4065 64 Be eae Se we EK ERE REE Rw ODE EO RE 13 Sie MO TOSORT DAIA 6 4 6k 6G OEAGK amp OE SS Sw BS EEE SHE ERE HOSS 14 3 1 7 HOW TO CONVERT RAW DATA INTO CONTINGENCY TABLE 204 15 3 1 8 HOW TO CONVERT CONTINGENCY TABLE INTO RAW DATA 0 208 16 3 1 9 FORMULAS cc 6 web hw cba eee ER OER EOE RR HE HOD ee HR Ew 16 3 1 10 HOW TO GENERATE DATA 2 20 3 1 11 MISSING DATA 4 06c caw tee bee eR ERLE KEE EER Ow 21 3 1 12 NORMALIZATION STANDARDIZATION 0 2 2 000 0 eee ee ee 24 3 1 13 SIMILARITY MATRIX soaa ee 25 3 2 HOW TO WORK WITH REPORTS RESULTS SHEETS 2 2 2 2 eee ee eee eee 35 3 3 HOW TO CHANGE LANGUAGE SETTINGS IN PQSTAT 2 2 2 22222 ee ee ee ee 36 Be MENU oe tba eke eh eed hE ee ee Oe ee ee ee Oe He 37 4 HOW TO ORGANISE WORK WITH PQSTAT
141. RIC TESTS 2 4 2 4 base bbw haw eee Ew ew eS we ES ED ewe 85 10 1 1 The t test for a single sample 1 2 ee 85 10 2 NONPARAMETRIC TESTS 4606 d44 4 0c e645 6 ee ERE ow HES Se HS a 88 10 2 1 The Kolmogorov Smirnov test and the Lilliefors test 2 00004 88 10 2 2 The Wilcoxon test signed ranks 2 eaa a 91 10 2 3 The Chi square goodness of fit test oaoa ea a a a a 94 10 2 4 Testsforproportion 1 a a 97 11 COMPARISON 2 GROUPS 101 11 1 PARAMEVRIL TESS s we he cae eo he OEE EOS HOHE HOES EE we GS 102 11 1 1 The Fisher Snedecor test oaoa a a 102 11 1 2 The t test for independent groups s saoao a 103 11 1 3 The t test with the Cochran Cox adjustment 2 2 a ee 104 11 1 4 The t test for dependent groups 2 107 11 2 NONPARAMETRIC TESTS 2 0 00 vee ee ee ee tw ee eh ee ee ee we eee 109 11 2 1 The Mann Whitney U test 0 0 2 a 109 11 2 2 The Wilcoxon test matched pairs 2 2 ee 112 11 2 3 TESTS FOR CONTINGENCY TABLES 2 2 ee ee ee ee ee ete te ee 114 11 2 4 The Chi square test for trend for Rx2tables 2 0 ee 118 11 2 5 The Chi square test and Fisher test for RxC tables 2 2 ee ee ee 120 11 2 6 The Chi square test and the Fisher test for 2x2 tables with corrections 125 11 2 7 Relative Risk and Odds Ratio 2 L 131 11 2 8 The Z test for 2 independent proportions 0 0 eee ee ee es 133 11 2 9 The McNemar test the Bowk
142. S FOR THE ANALYSIS IN MULTIDIMENSIONAL MODELS 224 17 1 1 Variable coding in multidimensional models 0000 eee eens 224 17 1 2 Interactions eK EEE ee eR ee EEG O De OBOE Do DS 227 17 2 MULTIPLE LINEAR REGRESSION ossaa ee 227 17 2 1 Model verification 4 4 2 68 seek be ede eee eRe EMEA EEE ew CO 229 17 2 2 More information about the variables inthe model 2 000008 231 17 2 3 Analysis of model residuals 1 232 17 2 4 Prediction on the basis of the model 2 2 ee ee 233 17 3 COMPARISON OF MULTIPLE LINEAR REGRESSION MODELS 0 02 8008 es 240 17 4 LOGISTIC REGRESSION 2a 664 45 ee kb ew SR HE ERE Eee koa Oe ee whe ew a 244 LAIL DUU RIO aw ian te baw tw hb daebwa ee ewe Cee eM ED ES Rae GE EES 246 17 4 2 Model verification aoaaa ee 247 17 5 COMPARISON OF LOGISTIC REGRESSION MODELS 2 a ee ee ee 260 18 DIMENSION REDUCTION AND GROUPING 264 18 1 PRINCIPAL COMPONENT ANALYSIS 2 0c ae ce ee ee ewe 264 18 1 1 The interpretation of coefficients related tothe analysis 2 0005 265 18 1 2 Graphical interpretation 0 a 266 18 1 3 The criteria of dimension reduction aoaaa ea e a a a 268 18 1 4 Defining principal components sooo e e e 268 18 1 5 The advisability of using the Principal component analysis ooa a 269 19 SURVIVAL ANALYSIS 276 19 1 LUFE TABLES israse ee a al anaa aa AO a a a a ew a 277 19 2 KAPLAN MEIER CURVES cs ncdenec ed aeaead nt
143. The test statistic has the form presented below P P2 ae Td VO O12 The Z Statistic asymptotically for the large sample size has the normal distribution On the basis of test statistics p value is estimated and then compared with the significance level a fp lt a reject Ho and accept H ifp gt a gt _ thereis no reason to reject Ho Note Confidence interval for the difference of two dependent proportions is estimated on the basis of the Newcombe Wilson method The window with settings for Z Test for two dependent proportions is accessed via the menu Statis tics Nonparametric tests nonordered categories gt Z Test for two dependent proportions E Z for 2 dependent proportions Statistical analysis Z test for two dependent proportions O11 Data Filter Set of the conditions that are applied to data to ae Produce a subset of your data All the rules are combined using the logical AND basic multiple a Report options Add analysed data iW Add graph EXAMPLE 11 9 cont file opinia pqs When we limit the study to people who have a specific opinion about the professor i e those who Copyright 2010 2014 PQStat Software All rights reserved 142 11 COMPARISON 2 GROUPS only have a positive or a negative opinion we will have 152 such students The data for calculations are O11 50 Ojo 4 Oo1 44 Oog 54 We know that 35 53 students expresse
144. This measure tells us how much the spread of data around the mean is similar to the spread of data in normal distribution The greater than zero the value of kurtosis is the more narrow the tested distri bution than normal one is And inversely the lower than zero the value of kurtosis is the flatter the tested distribution than the normal one is Kurtosis is defined by 7 n n 1 a eh 3 n 1 aan sd n 2 n 3 where x the following values of a variable T sd adequately arithmetic mean and standard deviation of x n sample size frequency EXAMPLE 7 1 fertilisers pqs file In an experiment related to a soil fertilising the with various sorts of microbiological specimens and fertilisers it was calculated how many microorganisms occur in a 1 gramme of dry mass of soil Now we would like to calculate descriptive statistics of the amount of actinomycetes for the sample fer tilised with nitrogen Additionally we want the data to be illustrated in the Box Whiskers plot In a datasheet we select only the 54 first rows which are relevant to the assumptions of the analysis there are actinomycetes fertilised with nitrogen Then we open Descriptive statistics window in Statistics menu Descriptive statistics In the window of descriptive statistics options select a variable to analyse the number of microor ganisms and then all the procedures you want to follow for example arithmetic mean altogether wi
145. a gt there is no reason to reject Ho The settings window with the ICC Intraclass Correlation Coefficient can be opened in Statistics menu Parametric tests ICC Intraclass Correlation Coefficient or in Wizard ICC Intraclass Correlation Coefficient Statistical analysis ICC Intraclass Correlation Coefficient sound level meter III Data Filter set of the conditions that are applied to data to ae produce a subset of your data All the rules are combined using the logical AND i basc mutipe W Report options Add analysed data C More results Add graph EXAMPLE 15 1 sound intensity pqs file The concordance of sound intensity was measured by three different meters The measurements were done in 12 different measuring points Copyright 2010 2014 PQStat Software All rights reserved 196 meter Ti A 84 84 84 B C D E F G H J K L Hypotheses Ho Hi a lack of an absolute concordance between the levels of sound intensity measured by three different meters in the population represented by the sample the levels of sound intensity measured in the population represented by the sample are absolutely concordant Analysis time Analysed variables Significance level Total sum of squares SS T Between conditions sum of squares SS BC Between subjects sum of squares SS BS Residual sum of squares SS RES Between c
146. a difference means that the variables present in the full model but absent in the reduced model do not carry significant information However if the difference is statistically significant it means that one of them the one with the greater number of variables is significantly better than the other one In the program PQStat the comparison of models can be done manually or automatically e Manual model comparison construction of 2 models a full model a model with a greater number of variables a reduced model a model with a smaller number of variables such a model is created from the full model by removing those variables which are superfluous from the perspective of studying a given phenomenon The choice of independent variables in the compared models and subsequently the choice of a better model on the basis of the results of the comparison is made by the researcher e Automatic model comparison is done in several steps step 1 Constructing the model with the use of all variables step 2 Removing one variable from the model The removed variable is the one which from the statistical point of view contributes the least information to the current model step 3 A comparison of the full and the reduced model step 4 Removing another variable from the model The removed variable is the one which from the statistical point of view contributes the least information to the current model step 5 A comparison o
147. a ry DD aan ij the value of j th measurement for t th object so 0 or 1 This statistic asymptotically for large sample size has the x distribution with a number of degrees of freedom calculated using the formula df k 1 The p value designated on the basis of the test statistic is compared with the significance level a fp lt a gt reject Ho and accept H ifp gt a gt there is no reason to reject Ho The POST HOC tests Introduction to the contrasts and the POST HOC tests was performed in the 12 1 2 unit which relates to the one way analysis of variance The Dunn test For simple comparisons frequency in particular measurements is always the same Hypotheses Example simple comparisons for the difference in proportion in a one chosen pair of measure ments Copyright 2010 2014 PQStat Software All rights reserved 163 12 COMPARISON MORE THAN 2 GROUPS SIA Ho the chosen incompatible observed frequencies are equal H the chosen incompatible observed frequencies are different i The value of critical difference is calculated by using the following formula kT R NIR Za 2 c n 2k k 1 where Za is the critical value statistic of the normal distribution for a given significance level a corrected on the number of possible simple comparisons c ii The test statistic is defined by k 2a CiPj a ITR 2 n k k 1 where p the proportio
148. able PCT Direction of diagnostic variable AUC SE AUC 95 CI 95 CI DeLong s method AUCI AUC SE AUC1 AUC 95 CI 95 CI zZ statistic p value under the ROC curve for PCT WBC PCT bacteremia i 0 05 bacteremia stimulant 0 86130799 0 051727687 0 759923577 0 96269238 stimulant 0 895618557 0 049011126 0 79955852 0 991678596 0 034310567 0 022679669 o 0 078761905 1 512833666 0 1303271915 Copyright 2010 2014 PQStat Software All rights reserved 220 wiby Sens 0 02 04 0 6 08 1 1 The calculated ares are AUC w pc 08613 AUC per 0 8956 On the basis of the adopted level 0 05 based on the obtained value p 0 13032 we conclude that we cannot determine which of the parameters WBC or PCT is better for diagnosing bacteremia ad2 PCT parameter is a stimulant its value is high in bacteremia In the course of the comparison of its diagnostic value for girls and boys we verify the following hypotheses Ho the area under ROC curve for PCT the area under ROC curve for PCIm H the area under ROC curve for PCT s the area under ROC curve for PCTm Copyright 2010 2014 PQStat Software All rights reserved 221 Analysis time Analysed variables Count of missing data Significance level Grouping variable Direction of diagnostic variable Method Group name Size Size STATE yes Size STATE no AUC SE AUC 95 CI 95 CI Group name Size Size STATE
149. able of true positive TP true negative TN false positive FP and false negative FN values Most frequently though diagnostic tests are based on continuous variables or ordered categorical variables In such a situation the proper means of evaluating the capability of the test for differentiating and are ROC Receiver Operating Characteristic curves It is frequently observed that the greater the value of the diagnostic variable the greater the odds of occurrence of the studied phenomenon or the other way round the smaller the value of the diagnostic variable the smaller the odds of occurrence of the studied phenomenon Then with the use of ROC curves the choice of the optimum cut off is made i e the choice of a certain value of the diagnostic variable which best separates the studied statistical population into two groups in which the given phenomenon occurs and in which the given phenomenon does not occur When on the basis of the studies of the same objects two or more ROC curves are constructed one can compare the curves with regard to the quality of classification Let us assume that we have at our disposal a sample of n elements in which each object has one of the k values of the diagnostic variable Each of the received values of the diagnostic variable 1 2 k becomes the cut off Zeat If the diagnostic variable is e stimulant the growth of its value makes the odds of occurrence of the s
150. abs v1 returns absolute value of the given number odd v1 returns 1 if the given nummber is even or 0 if the given number is odd sum v1 returns the result of an addition of the given numbers multip v1 returns the result of a multiplication of the given numbers power v1 n returns a value of the given number raised to the n th power norme v1 returns the Euclidean vector norm round v1 n returns a number rounded to n decimal places Statistical functions Funkcje statystyczne wymagaja argumentow liczbowych stand v1 returns a standardised score of the given numbers max v1 returns the highest value out of the given numbers min v1 returns the lowest value out of the given numbers mean v1 returns the arithmetical mean value of the given numbers meanh vi1 returns the harmonic mean value of the given numbers meang v1 returns the geometric mean value of the given numbers Copyright 2010 2014 PQStat Software All rights reserved 18 3 WORKING WITH DOCUMENTS median v1 returns the median value of the given numbers qi vi1 returns the lower quartile of the given numbers q3 v1 returns the upper quartile of the given numbers cv v1 returns the coefficient of variability value of the given numbers range v1 returns the range value of the given numbers iqrange v1 returns the interquartile range value of the given numbers v
151. al distribution in other words the observed feature distribution in a sample To define an empirical feature distribution you need to assign the fre quency of occurence to the following values of this feature Such distribution may be presented either in a frequency tables or in a graph histogram For small data sets the frequency table can show all the data so called a frequency distribution For the larger data sets they are called a grouped frequency distribution To present data distribution in a table you need to display Frequency tables window by selecting Statis tics menu Frequency tables Frequency tables Statistical analysis Frequency tables Varable T kind of contract fiv jel F interpret as a text Jnumber of contract ii intervals classes start value 130 step value 5 sort results Data Filter set of the conditions that are applied to data to EJ produce a subset of your data All the rules are combined using the logical AND basic mlipe W Report options Add anahsed data Add graph In this window you should select a variable that you want to have analysed and analysis options If the options are chosen properly we can sort the calculated result treating variables as text values or numbers If there are empty cells in an analysed column they can be included or omitted in an analysis The result of a particular analysis will occur in a report added to a datasheet for which the analysis h
152. al scale an independent model Hypotheses Ho 0i 65 Op H notall are equal j 1 2 k where 1 02 0 medians of the analysed variable of each population The test statistic is defined by H 5 HD 53 SE swan em where k N gt j Tlj nj samples sizes 7 1 2 k Ri ranks ascribed to the values of a variable for i 1 2 n j 1 2 k t t o Le N83 N t number of cases included in a tie correction for ties The formula for the test statistic H includes the correction for ties C This correction is used when ties occur if there are no ties the correction is not calculated because of C 1 The H statistic asymptotically for large sample sizes has the x distribution with the number of degrees of freedom calculated using the formula df k 1 The p value designated on the basis of the test statistic is compared with the significance level a fp lt a gt reject Ho and accept H ifp gt a gt there is no reason to reject Ho The POST HOC tests Introduction to the contrasts and the POST HOC tests was performed in the 12 1 2 unit which relates to the one way analysis of variance The Dunn test For simple comparisons equal size groups as well as unequal size groups Hypotheses Example simple comparisons comparison of 2 selected medians Copyright 2010 2014 PQStat Software All rights reserved 156 12 COMPARISON
153. always change the default settings relating to the kind of the data organi sation In this window you can also write the data which are supposed to be put into the contingency table As a result you can return to the report not only the test statistic and a p value but also e The contingency tables of observed frequencies data in the form of a contingency table This table shows the distribution of observations for several features several variables The table of the 2 features X Y one of them has r possible categories and the other one c possible categoies is shown below table 11 1 Table 11 1 The contingency table of r x c observed frequencies Observed frequenciesOy Y Ys D Y wi X On On Oe Dp Oy x On On o Ox SaO Co hw lel tl Feature X Xr Om Or e Ore DG Ory Total i l Oi J Oj2 A Pa ic n 2s gt E Oi Observed frequencies O i 1 2 r j 1 2 c show the frequencies of occur rance of all the particular categories for both features To return the table to the report you should choose the option add analysed data For data from the example 11 4 the contingency table of the observed frequencies looks like this Copyright 2010 2014 PQStat Software All rights reserved 116 11 COMPARISON 2 GROUPS oO higher primary secondary female 7 A A male 6 8 5 e The contingency table of expected frequencies for each contingency table of observed
154. are 100 patients of whom we know they will all die without the therapy In the sample with therapy we also have 100 patients of whom 50 will survive Patients not undergoing therapy Patients undergoing therapy sample denominator sample numerator sample denominator wo w f o T o We will calculate the NNT Analysis time Analysed variables Significance level Continuity correction Difference of the proportions 95 CI for the difference of the proportions 95 CI for the difference of the proportions MNT 95 CI NNT 95 CI NNT Z statistic p value asymptotic lt 0 000001 The difference between proportions is statistically significant p lt 0 000001 but we are interested in the NNT its value is 2 so the treatment of 2 patients for 20 years will prevent 1 death The calculated confidence interval value of 95 should be rounded off to a whole number wherefore the NNT is 2 to 3 patients 11 2 9 The McNemar test the Bowker test of internal symmetry Basic assumptions measurement on a nominal scale Copyright 2010 2014 PQStat Software All rights reserved 136 STM 11 COMPARISON 2 GROUPS SMA a dependent model The McNemar test The McNemar test NcNemar 1947 61 is used to verify the hypothesis determining the agreement between the results of the measurements which were done twice X and _X 2 of an X feature be tween 2 dependent variables X and X
155. are test Vanable 1 lt country group 8kin colour Data Filter set of the conditions that are applied to data to afe produce a subset of your data All the rules are combined using the logical AND basic multiple AND Report options Add analysed data Add graph Add percentages 5ums Copyright 2010 2014 PQStat Software All rights reserved 162 12 COMPARISON MORE THAN 2 GROUPS D Note This test can be calculated only on the basis of raw data 12 2 4 The Q Cochran ANOVA The Q Cochran analysis of variance based on the Q Cochran test is described by Cochran 1950 19 This test is an extended McNemar test for k gt 2 dependent groups It is used in hypothesis verification about symmetry between several measurements X X 2 X for the X feature The analysed feature can have only 2 values for the analysis there are ascribed to them the numbers 1 and 0 Basic assumptions measurement on a nominal scale dichotomous variables it means the variables of two cate gories a dependent model Hypotheses Ho allthe incompatible observed frequencies are equal H notallthe incompatible observed frequencies are equal where incompatible observed frequencies the observed frequencies calculated when the value of the analysed feature is different in several measurements The test statistic is defined by Q k 1 kC T kT R where C a 4 R Vey gt
156. ariables Contingency table Significance level 0 05 Phi 0 30989 Chi square statistic 16 325397 Degrees of freedom 1 p value 0 000053 Q Yule 0 578947 zZ statistic 9 211986 p value asymptotic 0 000001 Analysis time Analysed variables Significance level C Pearson 0 296005 C Pearson max 0 707107 C Pearson adjusted 0 418611 V Cramer 0 309869 Chi square statistic 165 325397 Degrees of freedom 1 p value 0 000053 Copyright 2010 2014 PQStat Software All rights reserved 192 14 CORRELATION D E no The test statistic value is y 16 33 and the p value calculated for it p 0 00005 The result indicates that there is a statistically significant dependence between sex and passing the exam in the analysed population Coefficient values which are based on the aa test so the strength of the correlation between anal ysed features are Caaj Pearson 0 42 V Cramer 0 31 The Q Yule 0 58 and the p value of the Z test similarly to y test indicates the statistically significant dependence between the analysed features Copyright 2010 2014 PQStat Software All rights reserved 193 15 AGREEMENT ANALYSIS 15 AGREEMENT ANALYSIS Interval scale Ordinal scale Nominal scale Are test of test of the data significance significance normally for the Kendall s W for the Cohen s amp distributed coefficient coefficient Kolmogorov Smirnov or Lilliefors test test of significan
157. ariance v1 returns the variance value of the given numbers sd v1 returns the standard deviation value of the given numbers Text functions Text functions work on any string of characters upperc v1 converts the characters from the string into capitalized characters lowerc v1 converts the characters from the string into characters written with small letters clean v1 removes the unprintable signs trim v1 removes initial and final spaces length v1 returns the length of the string of characters search abc v1 returns to the beginning of the search string concat v1 joins texts compare v1 compares texts copy v1 i n returns a part of the text starting from the ith character where n is the number of the returned characters count v1 returns the number of cells which are not empty counte v1 returns the number of empty cells countn v1 returns the number of cells which contain numbers Date and time functions The date and time functions should be performed on data formatted as date or as time see chapter 3 1 4 If that is not the case the program tries to recognize the format automatically When that is not possible it returns the NA value year v1 returns the year ascribed to the date month v1 returns the month ascribed to the date day v1 returns the day ascribed to the date hour v1 returns the hours ascribed
158. arison Degrees of freedom p value 19 SURVIVAL ANALYSIS survival time weeks status 0 05 Rx 0 1 144 556519789 146 556519789 149 002964233 151 360914552 0 230949395 0 766193449 O 764737344 172 7 59244379 174 759244379 174 902101522 176 160441761 0 080971661 0 398474704 0 397717426 28 20072459 1 0 000000109 The analysis is complemented with the presentation of the survival curves of both groups the treatment one and the placebo one corrected by the influence of log WBC for model B In the graph we observe the differences between the groups which occur at particular points of survival time In order to draw such curves having selected the Add a graph option we select the Survival function setpoints option and set the values for the Rx variable as O for the first curve the placebo group and 1 for the second curve the treatment group For the Log WBC variable we enter the mean value i e 2 93 0 38 ca nm Ph Survival probability ca Pi Time At the end we will evaluate the assumptions of Cox regression by analyzing the model residuals with respect to time Copyright 2010 2014 PQStat Software All rights reserved 301 Residuals 19 SURVIVAL ANALYSIS Residuals Martingale 0 5 10 15 20 25 30 35 Time Residuals Deviance o E am Schoenfeld log WBC 15 1 05 0 0 5 d 302 Copyright 2010 2014 PQStat Software All rights re
159. at 6 670952 9 624775 0 661429 1 566032 4 315676 17 MULTIDIMENSIONAL MODELS 0 50560 gross_profit prod_c advert i 0 05 noi gt 16 6 0 969923 0 940751 0 931774 4 663261 780 499712 13173 170769 12392 671057 104 793926 lt 0 000001 p value b stand b stand et 0 023041 lt 0 000001 lt 0 000001 0 512929 0 126274 0 000136 0 47057 0 511207 0 028828 0 066676 0 199191 0 053046 0 053114 0 043564 0 042635 0 046155 7 453646 3 646154 0 519231 1 605126 0 564103 mean amp 3 27051 4 574002 0 279524 0 93947 0 502356 prof ityoss 6 397 2 608 prod F 208 cogs F 1 92 Grom 133 discounts H 1 38 P0pulssthor 114 86 The final version of the model will be used for prediction On the basis of the predicted costs amounting to production cost 11 thousand dollars advertising costs 13 thousand dollars direct promotion costs 0 5 thousand dollars the sum of discounts made 0 5 thousand dollars and the fact that the author is known the author s popularity 1 we calculate the predicted gross profit together with the confidence interval Prediction for 3 prod_c Prediction for 4 advert_c Prediction for 5 prom_c Prediction for 6 rebates Prediction for 7 popular_author Prediction of Y value 95 CI for point 95 CI for point 95 CI for expected values 95 CI for expected values 11 13 0 5 0 5 1 72 418189 51 656275 82 980103 68 722994 76 113384
160. at setup_x64 FULL exe When you do this a setup dialog box will appear Press Next to continue with the installation setup The installation of the application requires you to accept the End User License Agreement If you ac cept the terms of the license select I accept the terms of the license and press Next to continue Otherwise select I do not accept the terms of the licence and press Cancel to exit the installation The following box enables you to change the default installation directory and to check if you have sufficient disc space It is recommended that the default location of instalation is accepted If you press Next there is a possibility to choose either a full installation of the application or a version not including exemplary data sets The data sets are used in the User Guide Next the dialog box informs you and gives you the possibility to change the shortcut name which will be created in Windows Menu Start Pressing Next you can create a Desktop Shortcut or add a shortcut to the Quick Lunch toolbar Press Next to continue The following step is the last one before the installation process starts copying files to your system This dialog box will show you the summary of installation options chosen so far To start the installation process press Install Copyright 2010 2014 PQStat Software All rights reserved 5 3 WORKING WITH DOCUMENTS sl 3 WORKING WITH DOCUMENTS Documents management
161. ata without headings before the analysis begins because usually there are more information in a datasheet You should also select the option indicating the content of the variable frequency numerator or proportion The difference between proportions distinguished in the sample is 30 56 a 95 and the confidence interval for it 15 90 43 35 does not contain 0 Based on the Z test without the continuity correction as well as on the Z test with the continuity correction p value 0 000053 and p value 0 0001 on the significance level a 0 05 the alternative hypothesis can be accepted similarly to the Fisher exact test its the mid p corrections the x test and the x test with the Yate s correction So the proportion of men who passed the exam is different than the proportion of women who passed the exam in the analysed population Significantly the exam was passed more often by women 55 56 out of all the women in the sample who passed the exam than by men 25 00 out of all the men in the sample who passed the exam EXAMPLE 11 8 Let us assume that the mortality rate of a disease is 100 without treatment and that therapy lowers the mortality rate to 50 that is the result of 20 years of study We want to know how many people have to be treated to prevent 1 death in 20 years To answer that question two samples of 100 people were taken from the population of the diseased In the sample without treatment there
162. ation The method is called partial as the search for the maximum of the likelihood function L the program makes use of the Newton Raphson iterative algorithm only takes place for complete data censored data are taken into account in the algorithm but not directly There is a certain error of estimation for each coefficient The magnitude of that error is estimated from the following formula SE Vdiag H where diag H is the main diagonal of the covariance matrix Note When building a model it ought to be remembered that the number of observations should be ten times greater than or equal to the ratio of the estimated model parameters k and the smaller one of the proportions of the censored or complete sizes p i e n gt 10k p Peduzzi P et al 1995 67 Note When building the model you need remember that the independent variables should not be multi collinear In a case of multicollinearity estimation can be uncertain and the obtained error values very high The multicollinear variables should be removed from the model or one independent variable should be built of them e g instead of the multicollinear variables of mother age and father age one can build the parents age variable Note The criterion of convergence of the function of the Newton Raphson iterative algorithm can be con trolled with the help of two parameters the limit of convergence iteration it gives the maximum num ber of iterations in which t
163. ature are changed in the different direction than for the Y feature the number of disagreed pairs i a ti a tico Iy Xizi lty tiny Xx Y t number of cases included in a tie The formula for the 7 correlation coefficient includes the correction for ties This correction is used when ties occur if there are no ties the correction is not calculated because of Ty 0i Ty 0 Note T the Kendall s correlation coefficient in a population T the Kendall s correlation coefficient in a sample The value of T lt 1 1 gt and it should be interpreted the following way e 7 x 1 means a strong agreement of the sequence of ranks the increasing monotonic correlation when the independent variable increases the dependent variable increases too e 7 x 1 means a strong disagreement of the sequence of ranks the decreasing monotonic cor relation when the independent variable increases the dependent variable decreases e if the Kendall s 7 correlation coefficient is of the value equal or very close to zero there is no monotonic dependence between analysed features but there might exist another relation a non monotonic one for example a sinusoidal relation The Spearman s r versus the Kendall s 7 for an interval scale with a normality of the distribution the r gives the results which are close to rp but 7 may be totally different from rp the 7 value is less or equal to rp value
164. ave been done Additionally if we want the data to be illustrated in a bar plot or a histogram we select Add graph option in the Frequency tables EXAMPLE 6 1 distribution pas file Some mobile network operator did the research which was supposed to show the use of free minutes given to his clients on a pay monthly contract Each customer may use up to 190 free minutes every month The research was done on the basis of 200 clients There were several sorts of information taken into account the kind of contract the amount of used free minutes the number of contracts taken by one client it does not apply to companies Now you want to present distribution of Copyright 2010 2014 PQStat Software All rights reserved 60 STM 6 FREQUENCY TABLES AND EMPIRICAL DATA DISTRIBUTION D 1 the kind of contract 2 the amount of used free minutes 3 the number of registered contracts with individual persons Open the Frequency tables window 1 Choose the variable that you want to analyse the kind of contract and select the option to interpret it as a text value and Add graph Then confirm all the chosen settings by clicking OK and you get the result presented in a report corporate individual 140 2 B Frequency z corporate kind of contract 2 Do the analysis again by clicking amp button Choose the variable that you want to analyse the amount of used free minutes and th
165. be displayed the value of which will be smaller than value d indicated by the user and the value on the main diagonal Copyright 2010 2014 PQStat Software All rights reserved 29 3 WORKING WITH DOCUMENTS elements above d means that in each row of the matrix only those elements will be displayed the value of which will be greater than value d indicated by the user and the value on the main diagonal Neighborhood 0 1 By choosing the option Neighborhood 0 1 we replace the values inside the matrix with value 1 and the empty places with value 0 In that manner we indicate for example if the objects are neighbors 1 or not 0 Standardization by rows Standardization by rows means that each element of the matrix is divided by the sum of the row of the matrix As a result the obtained values are in the range from O to 1 Replace the empty elements The option Replace the empty elements allows the entry of the value which is to be placed in the matrix instead of possible empty elements The selected identifier of the object allows us to name the rows and columns of the similarity matrix according to the nomenclature stored in the indicated variable EXAMPLE 3 3 file flats similarities pqs In the procedures of property pricing the issue of similarity is very important for both substantial and legal reasons For example it is the main premise for grouping objects and ascribing them to an appro priate segment Let us assume
166. ble by context menu File menu Print e Export reports Reports created in PQStat can be exported to a file in rtf supported by most of text editors such as Word pdf xml If the export is made in the Project Manager the reports can be placed in separate files or in one joint file To do this select the adequate reports and then the button and export to a file or files with the selected format Individual reports can be exported separately through the context menu in the report window e Describing reports Reports can be described in the Project Manager or in the context menu of report window by adding a title or a note e Editing graphs Editing graph relative to its General and Detailed Options is available in the context menu in the report window e Copying reports By means of the clipboard you can also move the results of an analysis into another applications for example Word or Excel e Deleting reports You can delete a report by Copyright 2010 2014 PQStat Software All rights reserved 35 3 WORKING WITH DOCUMENTS sl context menu Delete report Shift Del on the name of the report in the Navigation tree Project Manager However you should remember if there are any layers of map added to a datasheet and you delete datasheet all layers attached to it will be deleted too The order of reports can be changed with the use of the context menu of the right mouse button Up Ctrl Up or Down
167. ce for the Intraclass Correlation Coefficient rzcc Copyright 2010 2014 PQStat Software All rights reserved 194 15 AGREEMENT ANALYSIS 15 1 PARAMETRIC TESTS 15 1 1 The intraclass correlation coefficient and the test of its significance The intraclass correlation coefficient is used when the measurement of variables is done by a few judges k gt 2 It measures the strength of interjudge reliability the degree of its assessment concordance If the distribution of a variable is a normal distribution it can be represented in a dependent model for the interval scale M SBs M Sres r See owe ee eee uae ae ICC MSps k 1 M Sres M SBc M Sres where M Spo mean square between conditions between judges check ANOVA for depen dent groups M Spgs mean square between subjects M Sres mean square residual n sample size k number of judges Note Rioc the intraclass correlation coefficient in a population ricco the intraclass correlation coefficient in a sample The value of r7 g lt 1 1 gt and it should be interpreted in the following way e rioc 7 Litis an absolute concordance of objects assessment made by judges it is especially reflected in a high variance between objects a significant means difference between n objects and a low variance between judges assessments a small means difference of assessments des ignated by k judges e rrcc a
168. ce in time 0 t between the observed num ber of failure events and their number predicted by the model The value of the expected residu als is O but they have a diagonal distribution which makes it more difficult to interpret the graph they are in the range of co to 1 Deviance similarly to martingale asymptotically they obtain value O but are distributed symmetri cally around zero with standard deviation equal to 1 when the model is appropriate The de viance value is positive when the studied object survives for a shorter period of time than the one expected on the basis of the model and negative when that period is longer The analysis of those residuals is used in the study of the proportionality of the hazard but it is mainly a tool for identifying outliers In the residuals report those of them which are further than 3 standard deviations away from 0 are marked in red Schoenfeld the residuals are calculated separately for each independent variable and only defined for complete observations For each independent variable the sum of Shoenfeld residuals and their expected value is 0 An advantage of presenting the residuals with respect to time for each variable is the possibility of identifying a variable which does not fulfill in the model the assump tion of hazard proportionality That is the variable for which the graph of the residuals forms a systematic pattern usually the studied area is the linear dependence of the res
169. cient R ccud never assumes value 1 and is sensitive to the amount of variables in the model its corrected value is calculated 1 gmn Lemn Lo j lt 2InLg 2n Lys R Nagelkerke yp e 2 n InLo lub ACor Snell l e e Statistical significance of all variables in the model The basic tool for the evaluation of the significance of all variables in the model is the Like lihood Ratio test The test verifies the hypothesis Ho all Di 0 H there is 6 0 The test statistic has the form presented below xX 2In Lo Lrm 21In Lo 2In Lry The statistic asymptotically for large sizes has the y distribution with k degrees of free dom On the basis of test statistics p value is estimated and then compared with a ifo lt a wereject Ho and accept H1 ifp gt a gt there is no reason to reject Ho e Hosmer Lemeshow test The test compares for various subgroups of data the observed rates of occurrence of the distinguished value O and the predicted probability g If Og and Eg are close enough then one can assume that an adequate model has been built For the calculation the observations are first divided into G subgroups usually deciles G 10 Hypotheses Ho Og Eg for all categories H O E for at least one category Copyright 2010 2014 PQStat Software All rights reserved 248 17 MULTIDIMENSIONAL MODELS sl The test statistic has the form presented below G
170. contingency coefficient The Yule s Q contingency coefficient Yule 1900 88 is a measure of correlation which can be calcu lated for 2 x 2 contingency tables _ O11022 O12021 O11022 O12021 Copyright 2010 2014 PQStat Software All rights reserved 188 14 CORRELATION gl where 0O11 O12 O21 O22 observed frequencies in a contingency table The Q coefficient value is included in a range of lt 1 1 gt The closer to O the value of the Q is the weaker dependence joins the analysed features and the closer to 1 or 1 the stronger dependence joins the analysed features There is one disadvantage of this coefficient It is not much resistant to small observed frequencies if one of them is O the coefficient might wrongly indicate the total depen dence of features The statistic significance of the Yule s Q coefficient is defined by the Z test Hypotheses Ho Q 0 H Q 0 The test statistic is defined by 1 1 2 1 1 I5 e R a Og on om The test statistic asymptotically for a large sample size has the normal distribution The p value designated on the basis of the test statistic is compared with the significance level a fp lt a gt reject Ho and accept H fp gt a gt there is no reason to reject Ho The contingency coefficient The Phi contingency coefficient is a measure of correlation which can be calculated for 2 x 2 contingency tables where x value of the y test statist
171. correction for ties e For a large sample size T n n 1 Janm n n 1 2n 1 X t gt t 24 48 where n number of ranked signs number of the ranks t number of the cases included in a tie The formula for the Z statistic includes the correction for ties This correction is used when the ot ties occur if there are no ties the correction is not calculated because of 2t t 0 The Z statistic for large sample sizes asymptotically has the normal distribution Copyright 2010 2014 PQStat Software All rights reserved 112 11 COMPARISON 2 GROUPS The Wilcoxon test with the continuity correction Marascuilo and McSweeney 1977 60 The continuity correction is used to guarantee the possibility of taking in all the values of the real num bers by the test statistic according to the assumption of the normal distribution The test statistic with the continuity correction is defined by r 2e 0 5 nnt n41 EB t 24 48 The settings window with the Wilcoxon test for dependent groups can be opened in Statistics menu NonParametric tests ordered categories gt Wilcoxon matched pairs or in Wizard Wilcoxon matched pairs Statistical analysis Wilcoxon test for dependent groups Data Filter set of the conditions that are applied to data to Ei produce a subset of your data All the rules are combined using the logical AND basic E multiple ano Report options E Add anal
172. criptive statistics Measures of central tendency N Measures of variability n E Variance Anthmetic mean Standard deviation Geometric mean E Confidence interval forthe std dev C Median E Std er of the mean E Mode E Confidence interval for the mean Distribution Range C interquartile range Shkewn ess m Fercentiles C Std er of the skewness E Kurtosis Minimum Maximum E Std er of the kurtosis E Lower quartile Upper quartile T Percentile 10 and percentile 90 Report options C Add analysed data Remember when reducing a datasheet workspace using a data filter filter conditions may be matched with a conjunction or an alternative To change alternative and conjunctions Copyright 2010 2014 PQStat Software All rights reserved 45 SIAT 4 HOW TO ORGANISE WORK WITH PQSTAT use El buttons Multiple filter uses one rule to divide data into several subgroups The selected analysis is performed several times separately for each subgroup EXAMPLE 4 7 Multiple filter filter pqs file You want to calculate descriptive statistics for girls height and for boys height separately Choose Descriptive statistics from Statistic menu In the option window of descriptive Statistics choose procedures you want to have done select for example mean standard deviation minimum and maximum and variable to make analysis column including age Select multiple filter and add rule using Ei bu
173. ctice it is assumed that with n gt 30 the t Student distribu tion may be approximated with the normal distribution The settings window with the Single sample t test can be opened in Statistics menu Parametric tests t test or in Wizard Copyright 2010 2014 PQStat Software All rights reserved 85 t test Data Filter Set of the conditions that are applied to data to produce a subset of your data All the rules are combined using the logical AND a basic E multiple Add analysed data V Add graph Note Calculations can be based on raw data or data that are averaged like arithmetic mean standard devi ation and sample size EXAMPLE 10 1 courier pqs file You want to check if the time of awaiting for a delivery by some courier company is 3 days on the average uo 3 In order to calculate it there are 22 persons chosen by chance from all clients of the company as a sample After that there are written information about the number of days passed since the delivery was sent till it is delivered There are following values 1 1 1 2 2 2 2 3 3 3 4 4 4 4 4 5 9 0 6 6 7 7 The number of awaiting days for the delivery in the analysed population fulfills the assumption of normality of distribution Hypotheses Ho mean of the number of awaiting days for the delivery which is supposed to be delivered by the above mentioned courier company is 3 H mean of the number of awaiting days for
174. ction cost advertising costs and the author s popularity On the basis of the analyses above from the perspective of statistics the optimum model is the model with the 3 most important independent variables the production cost advertising costs and the au thor s popularity However the final decision which model to choose should be made by a person with specialist knowledge about the studied topic in this case the publisher It ought to be remem bered that the selected model should be constructed anew and its assumptions verified in the window Multiple regression Copyright 2010 2014 PQStat Software All rights reserved 243 17 MULTIDIMENSIONAL MODELS 17 4 LOGISTIC REGRESSION The window with settings for Logistic Regression is accessed via the menu Statistics Multidimensional Models Logistic Regression Statistical analysis Logistic Regression VWanable t Vanable X1X2 Test options 14GROUP 2 AddressOt Res 3 Sex d Bith Weight 5 MAge 6 Pregho SponAbart 8 Resp T Int Smoking 10 MEdu 11 VocationalE 12 SecondaryE 13 TertianE SpanAboart Resp I Int Smoaking 10 MEdu 11 VocationalE 12 SecondaryE 13 TertiaryE C Prediction Classification Cut off point 05 Hosmer Lemeshow test ROC curves Data Filter Set of the conditions that are applied to data to f produce a subset of your data All the rules are combined using the logical AND basic multipl
175. cus on the Rx 1 placebo O treatment variable We will place the log WBC variable in the model as a possible confounding factor which modifies the effect In order to evaluate the possible interactions of Rx and log WBC we will also consider a third variable a ratio of the interacting variables We will add the variable to the model by selecting in the analysis window the Interactions button and by setting appropriate options there Independent variable Interaction variables 4 log WBC 5 Rx We build three Cox models Model A only contains the Rx variable B coeff B error 95 CI 95 CI Wald stat p value Hazard rat 95 CI 95 CI Rx 1 50919153 0 4095644 00 7064599 27 3119228 13 578263 0 0002255 4 5230719 27 0268035 10 093814 Model B contains the Rx variable and the potentially confounding variable log WBC B coeff B error 95 CI 95 CI Wald stat p value Hazard rat 95 CI 95 CI log WBC 1 6043432 0 3293283 0 9588716 2 2498148 23 732115 0 0000011 4 9745910 27 6087511 9 4859783 Rx 1 2940672 00 4221039 O 4667566 22 1213757 9 39868516 0 0021712 3 6475921 1 5948164 88 3426073 Model C contains the Rx variable the log WBC variable and the potential effect of the interactions of those variables Rx x log WBC Copyright 2010 2014 PQStat Software All rights reserved 299 19 SURVIVAL ANALYSIS B coeff B eror 95 CI 95 CI Wald stat p value Hazard rat 95 CI 95 CI log WBC 1 8027
176. d a negative opinion before the exam After the exam the percentage was ee 61 84 Hypotheses Ho alack of a difference between the number of negative evaluations of the professor before and after the exam H there is a difference between the number of negative evaluations of the professor before and after the exam Analysis time Analysed variables Varl Var2 Vars Vard Significance level 0 05 Continuity correction Of 11 0 12 O 11 0 21 Proportion 1 0 355263 Proportion 2 0 618421 Difference of the proportions 0 263158 95 CI for the difference of the proportions 0 338845 955 CI for the difference of the proportions 0 180717 Z statistic 5 773503 p value asymptotic lt 0 000001 b S Ss Q S gt RS amp Difference of the proportions in The difference in proportions distinguished in the sample is 26 32 and the confidence interval of 95 for the sample 18 07 33 88 does not contain 0 On the basis of a Z test p 0 0001 on the significance level of 0 05 similarly to the case of Mc Nemar s test we accept the alternative hypothesis Therefore the proportion of negative evaluations before the exam differs from the proportion of negative evaluations after the exam Indeed after the exam there are more negative evaluations of the professor Copyright 2010 2014 PQStat Software All rights reserved 143 Yi 12 COMPARISON MORE THAN 2 GROUPS 12 COMPARISON MORE THAN 2 GROUPS
177. d component represents chiefly the original variable sepal width the remaining original variables are reflected in it to a slight degree The eigenvector factor loading and the contribution of the variable sepal width is the highest in the second component Each principal component defines a homogeneous group of original values We will call the first compo nent petal size as its most important variables are those which carry the information about the petal although it has to be noted that the length of the sepal also has a significant influence on the value of that component When interpreting we remember that the greater the values of that component the smaller the petals We will call the second component sepal width as only the width of the sepal is reflected to a greater degree here The greater the values of that component the narrower the sepal Finally we will generate the components by choosing in the analysis window the option Add Princi pal Components A part of the obtained result is presented below Copyright 2010 2014 PQStat Software All rights reserved 272 18 DIMENSION REDUCTION AND GROUPING 2 29 141 0 4 7842 0 12728 0 024088 2 074013 0 671883 0 233826 0 102663 2 990333 0 340766 0 04405 0 028282 2 291707 0 5954 0 090008 0 06573 2 381863 0 044671 0 015681 0 03580 2 068701 1 48420 0 02687 0 006586 2 435868 0 04748 0 33435 0 03665 2 225392 0 22240 0 088399 0 02453 2 326845 1
178. d features The V coefficient value depends also on the table size so you should not use this coefficient to compare different sizes of contingency tables The V contingency coefficient is considered as statistically significant if the p value calculated on the basis of the x test designated for this table is equal to or less than the significance level a The Pearson s C contingency coefficient The Pearson s C contingency coefficient is a measure of correlation which can be calculated for r x c where i contingency tables E 4 _ x n x value of the y test statistic n total frequency in a contingency table Copyright 2010 2014 PQStat Software All rights reserved 190 STM 14 CORRELATION The C coefficient value is included in a range of lt 0 1 The closer to O the value of C is the weaker dependence joins the analysed features and the farther from O the stronger dependence joins the analysed features The C coefficient value depends also on the table size the bigger table the closer to 1 C value can be that is why it should be calculated the top limit which the C coefficient may gain for the particular table size where w the smaller value out of r and c An uncomfortable consequence of dependence of C value on a table size is the lack of possibility of comparison the C coefficient value calculated for the various sizes of contingency tables A little bit better measure is a con
179. d using by different formulas e Fora small sample size 1 or 1 where n1 n s are sample sizes R1 Re are rank sums for the samples This statistic has the Mann Whitney distribution and it does not contain any correction for ties The value of the exact probability of the Mann Whitney distribution is calculated with the accu racy up to the hundredth place of the fraction e Fora large sample size U _ nine Z ee ni na nl n2 1 _ n na X t3 t 12 12 n1 n2 n1 n2 1 where U can be replaced with U t number of cases included in a tie The formula for the Z statistic includes the correction for ties This correction is used when ties f n n 5 t3 t occur if there are no ties the correction is not calculated because of _ 0 12 n n2 n n2 1 The Z statistic asymptotically for large sample sizes has the normal distribution Copyright 2010 2014 PQStat Software All rights reserved 109 11 COMPARISON 2 GROUPS stl The Mann Whitney test with the continuity correction Marascuilo and McSweeney 1977 60 The continuity correction should be used to guarantee the possibility of taking in all the values of real numbers by the test statistic according to the assumption of the normal distribution The formula for the test statistic with the continuity correction is defined as U 0 5 Jax V 2l 2 ni na nl n2 1 nin X t3 t 12 12 n1 n2 n n2 1
180. day Thursday Friday The p value from the y distribution with 4 degrees of freedom comes to 0 287297 So using the signifi cance level 0 05 you can estimate that there is no reason to reject the null hypothesis that informs about the compatibility of the number of served dinners with the expected number of dinners served within the particular days Copyright 2010 2014 PQStat Software All rights reserved 96 10 COMPARISON 1 GROUP Note If you want to make more comparisons within the framework of a one research it is possible to use the Bonferroni correction 1 The correction is used to limit the size of type error if we compare compare the observed frequencies and the expected ones between particular days for example Friday lt Monday Friday lt Tuesday Friday lt gt Wednesday Friday lt Thursday Provided that the comparisons are made independently The significance level a 0 05 for each com parison must be calculated according to this correction using the following formula a where r is the number of executed comparisons The significance level for each comparison according to the 0 05 Bonferroni correction in this example is a 0 0125 However it is necessary to remember that if you reduce a for each comparison the power of the test is increased 10 2 4 Tests for proportion You should use tests for proportion if there are two possible results to obtain one of them is a
181. del Too large a model involves a plethora of information in which the important ones may get lost Too small a model involves the risk of omitting those features which could describe the studied phenomenon in a reliable manner Because it is not the number of variables in the model but their quality that determines the quality of the model To make a proper selection of independent variables it is necessary to have knowledge and experience connected with the studied phenomenon One has to remember to put into the model variables strongly correlated with the dependent variable and weakly correlated with one another There is no single simple statistical rule which would decide about the number of variables necessary in the model The measures of model adequacy most frequently used in a comparison are Rig the corrected value of multiple determination coefficient the higher the value the more adequate the model SE the standard error of estimation the lower the value the more adequate the model For that purpose the F test based on the multiple determination coefficient R can also be used The test is used to verify the hypothesis that the adequacy of both compared models is equally good Hypotheses Hi Rpy F RRM where R y Rey Multiple determination coefficients in compared models full and reduced Copyright 2010 2014 PQStat Software All rights reserved 240 17 MULTIDIMENSIONAL MODELS The test statistics has t
182. del on the basis of the results of the comparison is made by the researcher e Automatic model comparison is done in several steps step 1 Constructing the model with the use of all variables step 2 Removing one variable from the model The removed variable is the one which from the statistical point of view contributes the least information to the current model step 3 A comparison of the full and the reduced model step 4 Removing another variable from the model The removed variable is the one which from the statistical point of view contributes the least information to the current model step 5 A comparison of the previous and the newly reduced model In that way numerous ever smaller models are created The last model only contains 1 indepen dent variable EXAMPLE 17 3 c d task pqs file In the experiment made with the purpose to study the concentration abilities a logistic regression model was constructed on the basis of the following variables dependent variable SOLUTION yes no information about whether the task was cor rectly solved or not independent variables ADDRESSOFRES 1 city 0 village SEX 1 female O male AGE in years EDUCATION 1 primary 2 vocational 3 secondary 4 tertiary TIME needed for the completion of the task in minutes DISTURBANCES 1 yes 0 no Copyright 2010 2014 PQStat Software All rights reserved 261 17 MULTIDIMENSIONAL MODELS Let us check if all indepen
183. dent variables are indispensible in the model e Manual model comparison On the basis of the previously constructed full model we can suspect that the variables ADDRES SOFRES and SEX have little influence on the constructed model i e we cannot successfully make classifications on the basis of those variables Let us check if from the statistical point of view the full model is better than the model from which the two variables have been removed 0 155ec ADDRESSOFRES SEX AGE E 0 05 Analysis time Analysed variables Significance level Number of variables in the model 1 7 ADDRESSOFRES SEX AGE E 128 7082434 0 267713 0 409674 0 303664 Analysed variables 2 Log Likelihood Pseudo R2 R2 Nagelkerke R2 Coxa Snella Number of variables in the model 2 5 AGE EDUCATION TIME DIST 131 0824474 0 254206 0 392363 0 290851 Analysed variables 2 Log Likelihood Pseudo R2 R2 Nagelkerke R2 Coma Snella Chi square Comparing models Chi square statistic Degrees of freedom p value intercept ADDRESSt SEX AGE EDUCATIC TIME DISTURBA Intercept AGE EDUCATIC TIME DISTURBA b coeff 7 230601 0 453242 0 454756 0 100896 0 455926 0 089395 1 924 6 762101 0 106345 0 50406 0 083768 1 847732 b error 1 870134 0 450524 0 451304 0 03159 0 241805 0 027609 0 475056 b error 1 821666 0 051611 0 23731 0 026838 0 46198 95 CI 3 565206 1 336253 1 339327 0 162812 0 018 0 143
184. dfwg degrees of freedom The Tukey test For simple comparisons equal size groups as well as unequal size groups i The value of a critical difference is calculated by using the following formula 2 V2 i da dfwa k i 2 MSwe CD ____V ____ J where da dfwo k is the critical value statistic of the studentized range distribution for a given significance level a and dfwa and k degrees of freedom ii The test statistic is defined by The test statistic has the studentized range distribution with dfwaq and k degrees of freedom Copyright 2010 2014 PQStat Software All rights reserved 147 12 COMPARISON MORE THAN 2 GROUPS Ai Info The algorithm for calculating the p value and the statistic of the studentized range distribution in PQStat is based on the Lund works 1983 54 Other applications or web pages may calculate a little bit dif ferent values than PQStat because they may be based on less precised or more restrictive algorithms Copenhaver and Holland 1988 Gleason 1999 The settings window with the One way ANOVA for independent groups can be opened in Statistics menu Parametric tests ANOVA for independent groups or in Wizard ANOVA for independent groups Vanable Test options Use the grouping variable POST HOC Tukey HSO Fisher LSD Scheffe Data Filter set of the conditions that are applied to data to produce a subset of your data All the rules are combined
185. died objects into group and group Copyright 2010 2014 PQStat Software All rights reserved 211 16 DIAGNOSTIC TESTS SMA Sensitivity 0 0 2 0 4 0 6 0 8 1 1 Specificity AUC area under curve the size of the area under the ROC curve falls within lt 0 1 gt The greater the field the more exact the classification of the objects in group and group on the basis of the analyzed diagnostic variable Therefore that diagnostic variable can be even more useful as a classifier The area AUC error SE yc and confidence interval for AUC are calculated on the basis of x nonparametric DeLong method DeLong E R et al 1988 26 Hanley J A i Hajian Tilaki K O 1997 38 recommended nonparametric Hanley McNeil method Hanley J A i McNeil M D 1982 39 x Hanley McNeil method which presumes double negative exponential distribution Hanley J A i McNeil M D 1982 39 computed only when groups and are equinumerous For the classification to be better than random distribution of objects into to classes the area under the ROC curve should be significantly larger than the area under the line y z i e than 0 5 Hypotheses Ho AUC 0 5 H AUC 0 5 The test statistics has the form presented below AUC 0 5 Z SE0 5 where fata tt S Eos y rgo m4 size of the sample in which the given phenomenon occurs n_ size of the sample in which the given phen
186. dies occur not occur cra 36 place of residence urban 50 50 Observed frequencies for women Observed frequencies for men leptospirosis antibodies occur not occur Prorat 24 a2e place of residence urban Gender is associated with both factors the occurrence of leptospirosis anibodies and the residence in West India This is a significant factor Its ignorance can lead to errors in results mey Copyright 2010 2014 PQStat Software All rights reserved 170 Analysis time Analysed variables Significance level strata 1 Odds Ratio 95 CI for the Odds Ratio 95 CI for the Odds Ratio Statistic for the Odds Ratio p value Strata 2 Odds Ratio 95 CI for the Odds Ratio 95 CI for the Odds Ratio Statistic for the Odds Ratio p value Odds Ratio MH 95 CI for the Odds Ratio MH 95 CI for the Odds Ratio MH Degrees of freedom Statistic for the Odds Ratio MH p value Homogeneity of the Odds Ratio Degrees of freedom Statistic p value Copyright 2010 2014 PQStat Software All rights reserved Contingency table 0 05 2 571429 1 237622 3 342701 2 531365 0 011362 1 714286 0 781346 3 76117 1 344493 0 178789 2 126374 1 244338 3 63363 1 7 81939 0 005169 1 0 546611 13 STRATIFIED ANALYSIS 171 13 STRATIFIED ANALYSIS The odds of the occurrence of leptospirosis antibodies is larger among village inhabitants both among women OR 95 C
187. difference that matrix U and V is replaced with the sum of matrices aa U and pan V The summation is made according to the strata created by the variables with respect to which we adjust the analysis 1 2 L The statistic asymptotically for large sizes has the y distribution with 1 degree of freedom On the basis of test statistics p value is estimated and then compared with the significance level a Copyright 2010 2014 PQStat Software All rights reserved 285 19 SURVIVAL ANALYSIS sl fp lt a wereject Ho and accept H1 fp gt a gt there is no reason to reject Ho Example 19 1 continued file transplant pqs The differences for two survival curves Liver transplantations were made in two hospitals We will check if the patients survival time after transplantations depended on the hospital in which the transplantations were made The comparisons of the survival curves for those hospitals will be made on the basis of all tests proposed in the program for such a comparison Hypotheses Ho the survival curve of the patients of hospital no 1 the survival curve of the patients of hospital no 2 H the survival curve of the patients of hospital no 1 the survival curve of the patients of hospital no 2 Analysis time Analysed variables time status Significance level 0 05 Grouping variable hospital hospital 1 hospital 2 Failure events Censored Test LogRank Chi square statistic 0 2743551308 Degre
188. ding to the following formula k 2 ao pty 1 See 2 sdi where k number of scale items sd variance of i item sd variance of items sum Standardised reliability coefficient Qsandarg is calculated according to the following formula kTp Qstandard 1 k l f p where Tp mean of all the Pearson s correlation coefficients for k k 1 2 scale items Alpha can take on any value less than or equal to 1 including negative values although only positive values make sense If all scale items are reliable the reliability coefficient is 1 There are some values that help in an assessesment of particular scale items usefulness e the value of a coefficient calculated after removing a particular scale item e the value of standard deviation of a scale calculated after removing a particular scale item e mean value of a scale calculated after removing a particular scale item e the Pearson s correlation coefficients between a particular item and the sum of other items Split half reliability Split half reliability is a random scale item division into 2 halves and an analysis of the halves correlation It is carried out by the Spearman Brown split half reliability coefficient published independently by Spearman 1910 75 and Brown 1910 17 2r rsH p 1 ge where i the Pearson s correlation coefficient between halves of a scale Copyright 2010 2014 PQStat Software All
189. e All rights reserved 283 19 SURVIVAL ANALYSIS sl 19 3 1 Differences among the survival curves Hypotheses Ho Silt So t S t for all H notall S t are equal In calculations was used chi square statistics form a U vV lu where Ui dja Wj dig eij V covariance matrix of dimensions k 1 x k 1 where wm 2 nig nj nig d nj d diagonal 2 w3 EE 2 Nijnijdj nj dj off diagonal De wj n nj 1 J m number of moments in time with failure event death dj yy dij observed number of failure events deaths in the j th moment of time dij observed number of failure events deaths in the wz th group w in the j th moment of time Cij mies expected number of failure events deaths in the w 2 th group w in the 7 th J moment of time nj DE Nij the number of cases at risk in the j th moment of time The statistic asymptotically for large sizes has the x distribution with df k 1 degrees of freedom On the basis of test statistics p value is estimated and then compared with the significance level a fp lt a wereject Ho and accept H fp gt a gt there is no reason to reject Ho Hazard ratio In the log rank test the observed values of failure events deaths O i l dij and the appropriate expected values E i eij are given The measure for describing the size of the difference between a pair of survival curves is
190. e you should fill in an upper half as well as a lower half of the window An upper half should be filled in exactly the same way as you do with data searching In the lower half of the window you should insert data which are supposed to replace the already found one Then you should click Find and Replace or Find and Replace All if you want to replace all the found data which occurred Both searching and replacing data accompanies a direct preview of a current action on the sheet 3 1 6 HOW TO SORT DATA The options of sorting data will be found after choosing Sort from Data menu or Sort option ina context menu of the number displayed above a column header Usually the whole datasheet is sorted this is a default setting but if you first select the part of the data then in the sorting window you will have an opportunity to reduce the area just to this selected part of the data Copyright 2010 2014 PQStat Software All rights reserved 14 3 WORKING WITH DOCUMENTS Datasort Sortin only inthe selected area Sorting in the sheet Choose variables Sequence pa sna Age See 4 eight B BMa In the window of sorting you can move using indicators from Choose variables box to Sequence box these variables according to which you want to sort the data Then you should choose Sort order and confirm your choice by clicking Run You can choose maximum 3 colums as a criteria of sorting If you sort data using more than one
191. e Ann Report options Add analysed data Covariance matrix Add graph Obs Predicted The constructed model of logistic regression similarly to the case of multiple linear regression allows the study of the effect of many independent variables X1 X2 Xk on one dependent variable Y This time however the dependent variable only assumes two values e g ill healthy insolvent solvent etc The two values are coded as 1 0 where 1 the distinguished value possessing the feature 0 not possessing the feature The function on which the model of logistic regression is based does not calculate the 2 level variable Y but the probability of that variable assuming the distinguished value eZ 1 e4 P Y 1 X1 Xo X_ where P Y 1 X1 X2 Xk the probability of assuming the distinguished value 1 on con dition that specific values of independent variables are achieved the so called probability predicted for 1 Z is most often expressed in the form of a linear relationship Z bo S bi Xi X1 X2 Xk independent variables explanatory Bo B1 B2 Bk parameters Copyright 2010 2014 PQStat Software All rights reserved 244 17 MULTIDIMENSIONAL MODELS Note Function Z can also be described with the use of a higher order relationship e g a square relationship in such a case we introduce into the model a variable containing the square of the independent variable X
192. e is used to verify the hypothesis informing us that the mean of the difference in the analysed population is O Basic assumptions measurement on an interval scale normality of distribution of measurements d or the normal distribution for an analysed variable in each measurement a dependent model Hypotheses Ho Ho 0 H Ho 0 where uo mean of the differences d in a population The test statistic is defined by i d 7ynn sdq l Copyright 2010 2014 PQStat Software All rights reserved 107 11 COMPARISON 2 GROUPS where d mean of differences d in a sample sdg standard deviation of differences d in a sample n number of differences d in a sample Test statistic has the Student distribution with n 1 degrees of freedom The p value designated on the basis of the test statistic is compared with the significance level a fp lt a gt reject Ho and accept H ifp gt a gt there is no reason to reject Ho Note e standard deviation of the difference is defined by ini Gi 7 d d n 1 y e standard error of the mean of differences is defined by SEM Ta The settings window with the t test for dependent groups can be opened in Statistics menu Parametric tests t test for dependent groups or in Wizard E t test for dependent groups Statistical analysis Hest for dependent groups Varable Variab
193. e n Std dev for X Std dev for Y Report options Add analysed data Copyright 2010 2014 PQStat Software All rights reserved 182 14 CORRELATION 14 2 NONPARAMETRIC TESTS 14 2 1 THE MONOTONIC CORRELATION COEFFICIENTS The monotonic correlation may be described as monotonically increasing or monotonically decreasing The relation between 2 features is presented by the monotonic increasing if the increasing of the one feature accompanies with the increasing of the other one The relation between 2 features is presented by the monotonic decreasing if the increasing of the one feature accompanies with the decreasing of the other one The Spearman s rank order correlation coefficient r is used to decribe the strength of monotonic relations between 2 features X and Y It may be calculated on an ordinal scale or an interval one The value of the Spearman s rank correlation coefficient should be calculated using the following formula _ 6 Mia G w n n 1 where di Rz Ry difference of ranks for the feature X and Y n number of d This formula is modified when there are ties n ux Dy oy d HE ey where Vy ie Tx au n n Ty 12 3 t aren of cases included in eee This correction is used when ties occur If there are no ties the correction is not calculated because the correction is reduced to the formula describing the above equation Note Rs the Spearman s rank c
194. e report Copyright 2010 2014 PQStat Software All rights reserved 63 6 FREQUENCY TABLES AND EMPIRICAL DATA DISTRIBUTION Analysis time Cumulative Cumulative frequency percent 0 20 1 1 852 1 852 20 40 3 5 556 7 407 40 60 11 111 18 519 60 80 3 38 589 57 407 80 100 29 63 87 037 100 120 7 407 94 444 120 140 3 5 556 100 Variable Number of microorganisms Frequency Copyright 2010 2014 PQStat Software All rights reserved 64 7 DESCRIPTIVE STATISTICS 7 DESCRIPTIVE STATISTICS We use descriptive statistics to describe main features of the collection of data for example mean value median or standard deviation and to draw some basic conclusions and generalisation about the collection of data To calculate descriptive statistics for data gathered in a sheet you should open the Descriptive statistics window which is in Statistics menu Descriptive statistics Descriptive statistics Statistical analysis Descriptive statistics Measures of central tendency N Measures of variability Y Varable 1 Number of microorganisms p Variance 2 Fertiliser Arthmetic mean Standard deviation Microorganisms Geometric mean Confidence interval forthe std dev ogen Hamonic mean Coefficient of the variability Median E Std er of the mean E Mode Confidence interval for the mean Distribution LC Range E interquartile range Percentiles Skewness Std er o
195. e sample There fore for the model to be well fitting the standard error of estimation S Ee expressed as e variance should be the smallest possible Multiple correlation coefficient R v R lt 0 1 gt defines the strength of the effect of the set of variables X1 X9 X on the dependent variable Y Multiple determination coefficient R it is the measure of model adequacy The value of that coefficient falls within the range of lt 0 1 gt where 1 means excellent model adequacy 0 a complete lack of adequacy The estimation is made using the fol lowing formula Iss Ess Rss where Tss total sum of squares Ess the sum of squares explained by the model Rgs residual sum of squares The coefficient of determination is estimated from the formula _ iss Ess It expresses the percentage of the variability of the dependent variable explained by the model As the value of the coefficient R depends on model adequacy but is also influenced by the number of variables in the model and by the sample size there are situations in which it can be encumbered with a certain error That is why a corrected value of that parameter is estimated R Rey Re ka R n k 1 Statistical significance of all variables in the model The basic tool for the evaluation of the significance of all variables in the model is the anal ysis of variance test the F test The test simultaneously verifie
196. e start Obviously it would be preferable to gain access to the real value and enter it in place of the missing data but that is not always possible The manner in which the missing data are treated depends primarily on their character In this pro gram a number of ways have been implemented for imputing the missing data for particular variables The window with the settings for the replacing missing data option is accessed from the menu Data Missing data Copyright 2010 2014 PQStat Software All rights reserved 21 3 WORKING WITH DOCUMENTS E Missing data substitution ves only from selected rows 1 1 Given value Value of mean median mode Random values Values from formulas Values predicted from regression Time Series Interpolation based on previous values Time Series Mean of n neighbors Time Series Median of n neighbors 1 Filling in with one value Selecting one of the options below will cause the replacement of all the missing data in the se lected column it with the same value e given by the user e the arithmetic mean calculated from the data e the geometric mean calculated from the data e the harmonic mean calculated from the data e the median e the mode unless it is multiple 2 Filled with many values The selection of one of the options below will cause the replacement of the missing data in the selected column with many usual
197. e variable which informs about the order of data the interpolation consists in determining a mean from the value for n antecedent neighbors and n neighbors directly following the missing data e the median from n neighbors it applies to time series so the user must point to the time variable which informs about the order of data the interpolation consists in determining a median from the value for n antecedent neighbors and n neighbors directly following the missing data Note In order to be able to distinguish the imputed data from the real data the replaced data are marked with a selected color EXAMPLE 3 1 file missingData publisher pqs The analysis of the file wydawca pqs not containing missing data was discussed in the chapter Multiple linear regression This time we will discuss a datasheet in which in the column containing the gross profit from a sale of books there are missing data In the case of those missing data we know the real values datasheet REAL VALUES so we can refer the values generated in the program in the place of the missing data to the real values and compare the results obtained with the use of various techniques In the example we will use 2 methods of replacing missing data replacing them with the value of the median and replacing them with a value determined on the basis of a regression model The remaining possibilities can be studied independently Replacing the missing data with the value of
198. eaker dependence joins the analysed features Basic assumptions measurement on the interval scale normality of distribution of an analysed features in a population Hypotheses Ho R 0 Hy Ry vz 0 The test statistic is defined by t Tp SE l r where SE n 2 The value of the test statistic can not be calculated when rp 1 or rp 1 or whenn lt 3 The test statistic has the t Student distribution with n 2 degrees of freedom The p value designated on the basis of the test statistic is compared with the significance level a fp lt a gt reject Ho and accept H ifp gt a gt there is no reason to reject Ho 14 1 3 The test of significance for the coefficient of linear regression equation This test is used to verify the hypothesis determining the lack of a linear dependence between an anal ysed features and is based on the slope coefficient also called an effect calculated for the sample The closer to 0 the value of 6 is the weaker dependence presents the fitted line Basic assumptions measurement on the interval scale normality of distribution of an analysed features in a population Hypotheses Copyright 2010 2014 PQStat Software All rights reserved 176 14 CORRELATION Ho p E 0 H B 0 The test statistic is defined by pd SE where Lsn aum sdyyn 1 Syn Sdya z 1 r sdz sd standard deviation of the value of features X and Y
199. ear combination which defines a com ponent The sign of those coefficients points to the direction of the influence and is accidental which does not change the value of the carried information Factor loadings Factor loadings just as the coefficients included in the eigenvector reflect the influence of partic ular variables on a given principal component Those values illustrate the part of the variance of a given component is constituted by the original variables When an analysis is based on the cor relation matrix we interpret those values as correlation coefficients between original variables and a given principal value Copyright 2010 2014 PQStat Software All rights reserved 265 18 DIMENSION REDUCTION AND GROUPING Variable contributions They are based on the determination coefficients between original variables and a given principal component They show what percentage of the variability of a given principal component can be explained by the variability of particular original variables Communalities They are based on the determination coefficients between original variables and a given principal component They show what percentage of a given original variable can be explained by the variability of a few initial principal components For example the result concerning the second variable contained in the column concerning the fourth principal component tells us what percent of the variability of the second variable can be exp
200. eature H there is a dependence between the analysed features of the population e Hypotheses in the meaning of homogeneity Ho inthe analysed population the distribution of X feature categories is exactly the same for each category of Y feature H inthe analysed population distribution the of X feature categories is different for at least one category of Y feature Compare the p value calculated on the basis of the test statistic with the significance level a fp lt a gt reject Ho and accept H ifp gt a gt there is no reason to reject Ho The Chi square test for R x C tables The y test for r x c tables is also known as the Pearson s Chi square test Karl Pearson 1900 This test is an extension on 2 features of the x test goodness of fit The test statistic is defined by 2 ry Oj Fix Ca 9 i 1 j l J This statistic asymptotically for large expected frequencies has the y distribution with a number of degrees of freedom calculated using the formula df r 1 c 1 Compare the p value calculateld on the basis of the test statistic with the significance level a The settings window with the Chi square test RxC can be opened in Statistics menu NonPara metric tests unordered categories gt Chi square RxC or in Wizard Copyright 2010 2014 PQStat Software All rights reserved 121 11 COMPARISON 2 GROUPS Chi square Rx Statistical analysis Chi square
201. ect wrong wrong wrong wrong wrong correct wrong wrong wrong wrong correct correct correct correct correct correct correct wrong correct correct correct wrong wrong correct wrong wrong wrong correct correct wrong correct wrong wrong correct wrong wrong correct wrong wrong wrong wrong wrong correct wrong correct wrong wrong wrong wrong wrong wrong Ho The individual questions received the same number of correct answers in the analysed population H There are different numbers of correct and wrong answers in individual test questions in the analysed population Analysis time Analysed variables answer task l answer task answer ta Significance level 0 05 Comparing the p value p 0 007699 with the significance level a 0 05 we conclude that individual test questions have different difficulty levels We resume the analysis to perform POST HOC test by clicking and in the test option window we select POST HOC Dunn Copyright 2010 2014 PQStat Software All rights reserved 165 STM 12 COMPARISON MORE THAN 2 GROUPS 2 02982 0 31625 0 01328 Frequency rom w e kn cH os co lo correct correct wrong correct wrong wrong wrongcorrectAwrong wrong wrong wroang wrong wrong correc correct correct correct wrong correct correct correct wrong correct The carried out POST HOC analysis indicates that there are differences between the 2 nd and 1 st
202. ed in a Settings window However note that the higher the values used in a operation the more computer memory is used by the program How to insert and delete rows and columns You can insert empty columns or rows above or on the left side of already existing ones It will move the old ones down or to the right side To insert row rows you should select the one ones above which you want to insert new ones Then you should choose Insert row in a context menu of the number of selected row Exactly the same way you can insert new columns Rows and columns can be both inserted and deleted You can delete them by selecting Delete row Delete column on the context menu of the number of a row or a column How to find replace a cell value To find or replace cell value contents with another value you should use a Search Replace window which you can find in Edit menu Find Replace Ctrl F To search use upper half of the window to change a cell content use lower half of the window Copyright 2010 2014 PQStat Software All rights reserved 13 3 WORKING WITH DOCUMENTS E Find Replace Find f Search Order by columns and replace female SS Find and Replace Ss Find and Replace All X Close To find specific data you should write the right characters in the upper half of the window then select the sequence of searching and click Find To find and to replace the whole cell content with another valu
203. edian of the time spent in front of a computer screen is exactly the same both in the male and the female population of students at the analysed university H the median of the time spent in front of a computer screen is different among the male population and the female population of students at the analysed university Copyright 2010 2014 PQStat Software All rights reserved 110 11 COMPARISON 2 GROUPS Analysis time Analysed variables time hours sex Significance level Continuity correction Grouping variable Group name Group size Mean of the ranks for the group Group sum of ranks Group median Group name Group size Mean of the ranks for the group Group sum of ranks Group median U statistic p value asymptotic Median Me oia L Min Max 7 6 time hours ln f m Based on the assumed level 0 05 and the Z statistic of the Mann Whitney test without the conti nuity correction p value 0 015441 and with the continuity correction p value 0 015821 and also based on the exact U statistic p value 0 014948 you can assume that there are statistically significant differences among male and female students if it goes about the time spent in front of a computer These differences are that female students spend less time in front of a computer than male students the mean of the ranks for women is 22 02 the median is 5 and it is much lower than the mean of the ranks for men w
204. ely new variables principal components which are a linear combination of the observed original variables An exact analysis of the principal compo nents makes it possible to point to those original variables which have a big influence on the appearance of particular principal components that is those variables which constitute a homogeneous group A principal component is then a representative of that group Subsequent components are mutually or thogonal uncorrelated and their number k is lower than or equal to the number of original variables p Particular principal components are a linear combination of original variables i Qi X1 di2 Xa Qin X p Copyright 2010 2014 PQStat Software All rights reserved 264 18 DIMENSION REDUCTION AND GROUPING where X 1 X2 Xp original variables Qil Vi2 Qip Coefficients of the ith principal component Each principal component explains a certain part of the variability of the original variables They are then naturally based on such measures of variability as covariance if the original variables are of similar size and are expressed in similar units or correlation if the assumptions necessary in order to use covariance are not fulfilled Mathematical calculations which allow the distinction of principal components include defining the eigenvalues and the corresponding eigenvectors from the following matrix equation M rI a 0 where A
205. en the option Intervals ranks set start value which is for example 130 and a step value is 5 You may also select Add graph option Next confirm all the chosen options by clicking OK and you get the result presented in a report Copyright 2010 2014 PQStat Software All rights reserved 61 Cumulative Cumulative frequency percent Vanable amount of used free minutes Frequency 130 135 135 140 140 145 145 150 150 155 155 160 160 165 165 170 170 175 175 180 180 185 185 190 18 16 Percent n Mo e A 130 135 135 140 140 145 145 150 150 155 155 160 160 165 165 170 170 175 175 180 180 185 185 190 amount of used free minutes 3 Do the analysis again by clicking the button Set the filter so that the analysis is carried out only for individual persons Choose the variable you want to analyse the number of contracts This variable includes missing data empty cases that is why they may be taken into account as well as not be taken in the result It depends on the chosen option which refers to ignoring or not the empty cases Copyright 2010 2014 PQStat Software All rights reserved 62 6 FREQUENCY TABLES AND EMPIRICAL DATA DISTRIBUTION Analysis time Data Filter kind of contract individual Cumulative Cumulative frequency percent 79 661 79 5661 10 169 89 831 9 3220 99 153 0 847 100 Variable number of contracts
206. ent t statistic for r Degrees of freedom p value a slope Std err of a 95 CI for a coefficient 95 CI for a coefficient t test statistic for a Degrees of freedom p value b Y intercept Std err of b 95 CI for b coefficient 95 CI for b coefficient t test statistic for b Degrees of freedom p value prediction of Y value for X 6 95 CI for the prediction of Y 959 CI for the prediction of Y Copyright 2010 2014 PQStat Software All rights reserved 14 CORRELATION hight 143 6875 11 187605 6 456417 0 630153 0 669154 0 149006 0 566316 0 939318 3 27 1206 14 0 000069 3 090113 0 913646 3 130536 7 049589 5 971206 i4 0 000069 105 829787 6 984318 90 8549915 120 809659 15 152487 14 lt 0 000001 136 370463 121 8213486 150 919578 179 14 CORRELATION D xy linear y 105 830 x 5 090 165 135 130 125 age Comparing the p value 0 000069 with the significance level a 0 05 we draw the conclusion that there is a linear dependence between age and height in the population of children attening to the analysed school This dependence is directly proportional it means that the children grow up as they are getting older The Pearson product moment correlation coefficient so the strength of the linear relation between age and height counts to r 0 8302 Coefficient of determination i 0 6892 means that about 69 variability of height is explained by the chan
207. er test of internal symmetry 004 136 11 2 10 Z Test for two dependent proportions sosoo e e 141 12 COMPARISON MORE THAN 2 GROUPS 144 12 1 PARAMETRIC TESTS 4 064 h 508 46558649 4 Oo ow Sed 6 SREY SRY a OR GS 145 12 1 1 The ANOVA for independent groups 2 e 145 12 1 2 The contrasts andthe POST HOC tests ee 146 12 1 3 The Brown Forsythe test and the Levene test 2 002 eee eee eae 151 12 1 4 The ANOVA for dependent groups 2 a 152 12 2 NONPARAMETRIC TESTS 0 eee ew ee ee ee ee ee ee 156 12 2 1 The Kruskal Wallis ANOVA 0 2 ee 156 12 2 2 The Friedman ANOVA sse a ee 158 12 2 3 The Chi square test for multidimensional contingency tables 161 12 2 4 The Q Cochran ANOVA sasaaa ee 163 13 STRATIFIED ANALYSIS 167 13 1 THE MANTEL HAENSZEL METHOD FOR SEVERAL 2x2 TABLES 2 2 eee ee eee 167 13 1 1 The Mantel Haenszel odds ratio 2 167 13 1 2 The Mantel Haenszel relative risk 2 2 a 172 14 CORRELATION 174 14 1 PARAMETRIC TESTS i 26848 s dea RB Gee Oe RRS EAR ew ER DERE E DHS 175 14 1 1 THE LINEAR CORRELATION COEFFICIENTS 2 0 0 eee eee ee es 175 14 1 2 The test of significance for the Pearson product moment correlation coefficient 176 14 1 3 The test of significance for the coefficient of linear regression equation 176 14 1 4 The test for checking the equality of the Pearson product moment correlation coeffi cients which come
208. er the transplantation Therefore we conclude that the initial period after the transplantation does not carry a particular risk of death The value of the median shows that for 10 years after the transplantation a half of the patients have died and another half is still alive The value is marked on the graph by drawing a line in point 0 5 which signifies the median In a similar manner we mark the quartiles in the graph 19 3 COMPARISON OF SUVIVAL CURVES The survival functions can be built separately for different subgroups e g separately for women and men and then compared Such a comparison may concern two curves or more The window with settings for the comparison of survival curves is accessed via the menu Statystyka Survival analysis Comparison groups Copyright 2010 2014 PQStat Software All rights reserved 282 19 SURVIVAL ANALYSIS E Comparison groups Statistical analysis Comparison groups for trend 5 age ordinal Data Filter Set of the conditions that are applied to data to pa produce a subset of your data All the rules are combined using the logical AND J basic O multiple ano Add analysed data More results V Add graph Comparisons of k survival curves 5 S9 Sz at particular points of the survival time t in the program can be made with the use of three tests Log rank test the most popular test drawing on the Mantel Heanszel procedure for many 2 x 2 tables Mantel
209. erification of statistical hypotheses testing some specific assumptions formulated for the parameters of the general population on the basis of sample results 9 0 1 POINT AND INTERVAL ESTIMATION In practice we usually do not know the parameters characteristics of the whole population There is only a sample chosen from the population Point estimators are the characteristics obtained from a random sample The exactness of the estimator is defined by its standard error The real parameters of population are in the area of the indicated point estimator For example the population parameter arithmetic mean p is in the area of the estimator from the sample which is 7 If you know the estimators of the sample and their theoretical distributions you can estimate values of the population parameters with the confidence level 1 a defined in advance This process is called interval estimation the interval confidence interval and a is called a significance level The most popular significance level comes to 0 05 0 01 or 0 001 9 0 2 VERIFICATION OF STATISTICAL HYPOTHESES To verify a statistical hypotheses follow several steps The 1st step Make a hypotheses which can be verified by means of statistical tests Each statistical test gives you a general form of the null hypothesis Ho and the alternative one H Ho there is no statistically significant difference among populations means medians proportions distributions etc H
210. erify the hypothesis determining the equality of 2 coefficients of the linear regres sion equation 6 and gt in analysed populations Basic assumptions e 1 and 62 come from 2 samples which are chosen randomly from independent populations e 61 and describe the strength of dependence of the same features X and Y e both sample sizes n and n are known e standard deviations for the values of both features in both samples sdz sdy and sdz Sdys are known e the Pearson product moment correlation coefficients of both samples rp and rp are known Hypotheses Ho P1 Pe Hy By A Bo The test statistic is defined by 81 Ba x S x sdz nya T sd mT where n 1 Syxy sdy n 5 l E rp N2 1 Ss ddys 5d S The test statistic has the t Student distribution with n nz 4 degrees of freedom The p value designated on the basis of the test statistic is compared with the significance level a Copyright 2010 2014 PQStat Software All rights reserved 181 14 CORRELATION fp lt a reject Ho and accept H ifp gt a gt there is no reason to reject Ho The settings window with the comparison of correlation coefficients can be opened in Statistics menu gt Paralmetric tests comparison of correlation coefficients Comparison of correlation coefficients Statistical analysis Comparison of the correlation coefficients Pearson corelation coefficient sample siz
211. es in a contingency table e Hypotheses in the meaning of independence Ho there is no dependence between the analysed features of the population both classifications are statistically independent according to X and Y feature H there is a dependence between the analysed features of the population e Hypotheses in the meaning of homogeneity Ho inthe analysed population the distribution of X feature categories is exactly the same for both categories of Y feature H inthe analysed population the distribution of X feature categories is different for both categories of Y feature The p value designated on the basis of the test statistic is compared with the significance level a fp lt a gt reject Ho and accept H ifp gt a gt there is no reason to reject Ho Note Additionally for 2 x 2 contingency tables PQStat calculates also the odds ratio OR and the relative risk RR altogether with the confidence intervals These intervals are calculated on the basis of the approximate y distribution if they accompany the x test or of the exact algorithms if they ac company the Fisher s test and mid p The Chi square test for 2 x 2 tables The y test for 2 x 2 tables The Pearson s Chi square test Karl Pearson 1900 is constraint of the x test for r x c tables Copyright 2010 2014 PQStat Software All rights reserved 125 The test statistic is defined by This statistic asymptotically for
212. es of freedom 1 p value Group Obs Exp Obs Exp hospital 1 21 19 257088 1 0905075 hospital 2 32 33 742911 0 9483473 Group Hazardr 95 CI 95 CI hospital 1 lt 1 1499031 0 6569948 2 0126144 Survival function or hospital 1 or hospital 7 Survival probability mn ba i 02 Copyright 2010 2014 PQStat Software All rights reserved 286 19 SURVIVAL ANALYSIS On the basis of the significance level a 0 05 based on the obtained value p 0 6004 for the log rank test p 0 6959 for Gehan s and 0 6465 for Tarone Ware we conclude that there is no basis for rejecting the hypothesis Ho The length of life calculated for the patients of both hospitals is similar The same conclusion will be reached when comparing the risk of death for those hospitals by determin ing the risk ratio The obtained estimated value is HR 1 1499 and 95 of the confidence interval for that value contains 1 0 6570 2 0126 Differences for many survival curves Liver transplantations were made for people at different ages 3 age groups were distinguished 45 years 50 years 50 years 55 years 55 years 60 years We will check if the patients survival time after transplantations depended on their age at the time of the transplantation Hypotheses Ho survival rates of patients aged 45 years 50 years 50 years 55 years 55 years 60 years are similar H atleast one survival curve out
213. f the previous and the newly reduced model In that way numerous ever smaller models are created The last model only contains 1 indepen dent variable EXAMPLE 19 2 file remissionLeukemia pqs The analysis is based on the data about leukemia described in the work of Freirich et al 1963 32 and further analyzed by many authors including Kleinbaum and Klein 2005 44 The data contain informa tion about the time in weeks of remission until the moment when a patient was withdrawn from the Copyright 2010 2014 PQStat Software All rights reserved 298 19 SURVIVAL ANALYSIS study because of an end of remission a return of the symptoms or of the censorship of the informa tion about the patient The end of remission is the result of a failure event and is treated as a complete observation An observation is censored if a patient remains in the study to the end and remission does not occur or if the patient leaves the study Patients were assigned to one of two groups a group undergoing treatment marked as 1 anda placebo group marked as 0 The information about the patients sex was gathered 1 man O woman and about the values of the indicator of the number of white cells marked as log WBC which is a well known prognostic factor The aim of the study is to determine the influence of treatment on the time of remaining in remission taking into account possible confounding factors and interactions In the analysis we will fo
214. f the skewness Kurtosis E Minimum Maximum Std er of the kurtosis E Lower quartile Upper quartile Percentile 10 and percentile 90 Data Filter Report options Add analysed data set of the conditions that are applied to data to Add graph produce a subset of your data All the rules are combined using the logical AND ano basic multiple In this window you need to select variables you want to analyse and then select all the descriptive statistics measures you need for the analysis However note that you can select separate statistics or groups of statistics using ka button Confirm your choice by clicking OK The result of the analysis will be presented in a report added to the datasheet on the basis of which the analysis was done Additionally if we want the data to be illustrated in a Box Whiskers plot we select Add graph option in the Descriptive statistics window 7 1 MEASUREMENT SCALES A properly defined kind of an analysis depends on the scale on which the data are presented There are 3 main measurement scales 1 Interval scale Variables are assessed on an interval scale if itis possible to order them it is possible to calculate how much one element is greater than the other one and the differences between these elements are interpretable in a real world Usually the unit of measurement is defined Copyright 2010 2014 PQStat Software All rights reserved 65 STM 7 DESCRIPT
215. for smaller sizes Bayesian information criterion or Schwarz criterion BIC 2 nlpy k In d Just like the corrected Akaike criterion it takes into account the sample size the number of failure events Volinsky and Raftery 2000 78 e Pseudo R the so called McFadden R is a goodness of fit measure of the model an equiv alent of the coefficient of multiple determination R defined for multiple linear regression The value of that coefficient falls within the range of lt 0 1 where values close to 1 mean excellent goodness of fit of the model 0 a complete lack of fit Coefficient ae is calculated according to the formula In Lru R _ Pseudo In Lo As coefficient Te eae does not assume value 1 and is sensitive to the amount of variables in the model its corrected value is calculated 1 e 2 d In Lrm in Lo 2In Lo 2 In LFM R a e d Nagelkerke 1 e 2 d In Lo i 2 _ lub Cp Suen l e Copyright 2010 2014 PQStat Software All rights reserved 295 19 SURVIVAL ANALYSIS SAI e Statistical significance of all variables in the model The basic tool for the evaluation of the significance of all variables in the model is the Like lihood Ratio test The test verifies the hypothesis Ho all Di 0 H there is 6 0 The test statistic has the form presented below x 2In Lo Lrm 2In Lo 2 n Lry The statistic asymptotically for large
216. g data Copying with relation Normalization Standardization Similarity matrix Statistics menu Frequency tables Descriptive statistics Probability distribution calculator e Parametric tests comparison of a one group t test comparison dependent groups t test for dependent groups ANOVA for dependent groups comparison independent groups t test for independent groups F Fisher Snedecor ANOVA for independent groups Levene Brown Forsythe measures of correlation and their comparisons Linear correlation r Pearson Comparison of correlation coefficients measures of agreement ICC Intraclass Correlation Coefficient e Nonparametric tests ordered categories comparison of a one group Wilcoxon signed ranks Kolmogorov Smirnov Lilliefors comparison dependent groups Wilcoxon matched pairs Friedman ANOVA comparison independent groups Copyright 2010 2014 PQStat Software All rights reserved 38 3 WORKING WITH DOCUMENTS sl Mann Whitney Chi square for trend Kruskal Wallis ANOVA measures of correlation Monotonic correlation r Spearman Monotonic correlation tau Kendall measures of agreement Kendall s W e Nonparametric tests unordered categories comparison of a one group Chi square Z for proportion comparison dependent groups Z for 2 dependent proportions Bowker McNemar Cochran Q ANOVA comparison independent groups Z for 2 independent proportions Chi
217. ge sample sizes Hypotheses Ho P Py H P Py where Pi P fraction for the first and the second population The test statistic is defined by where mitme p ni tno The test statistic modified by the continuity correction is defined by ifd 1 T 3 a 1 1 p 1 p 4 a 1 The Z Statistic with and without the continuity correction asymptotically for the large sample sizes has the normal distribution The p value designated on the basis of the test statistic is compared with the significance level a Copyright 2010 2014 PQStat Software All rights reserved 133 11 COMPARISON 2 GROUPS SAI fp lt a gt reject Ho and accept H fp gt a gt there is no reason to reject Ho Apart from the difference between proportions the program calculates the value of the NNT NNT number needed to treat indicator used in medicine to define the number of patients which have to be treated for a certain time in order to cure one person Note From PQStat version 1 3 0 the confidence intervals for the difference between two independent pro portions are estimated on the basis of the Newcombe Wilson method In the previous versions it was estimated on the basis of the Wald method The justification of the change is as follows Confidence intervals based on the classical Wald method are suitable for large sample sizes and for the difference between proportions far from 0 or 1 For small samples and
218. ging of age From the regression equation height 5 09 age 105 83 it is possible to calculate the predicted value for a child for example in the age of 6 The predicted height of such child is 136 37cm 14 1 4 The test for checking the equality of the Pearson product moment correlation coefficients which come from 2 independent populations This test is used to verify the hypothesis determinig the equality of 2 Pearson s linear correlation coef ficients Rp Rpa Basic assumptions 1p and rp come from 2 samples which are chosen randomly from independent populations Tp and rp describe the strength of dependence of the same features X and Y e sizes of both samples n and n are known Hypotheses Ho Tip Ron Ait dip F Lps Copyright 2010 2014 PQStat Software All rights reserved 180 14 CORRELATION gl The test statistic is defined by P rp Srp 1 1 n1 3 T Nn 3 where i 1 Zp in aD Pl 2 1 rpi 1 1 Zp in 1 T po p2 2 1 lo The test statistic has the t Student distribution with n nz 4 degrees of freedom The p value designated on the basis of the test statistic is compared with the significance level a fp lt a gt reject Ho and accept H ifp gt a gt there is no reason to reject Ho 14 1 5 The test for checking the equality of the coefficients of linear regression equation which come from 2 independent populations This test is used to v
219. gt a gt there is no reason to reject Ho The POST HOC tests Introduction to the contrasts and the POST HOC tests was performed in the 12 1 2 unit which relates to the one way analysis of variance The Dunn test For simple comparisons frequency in particular measurements is always the same Hypotheses Example simple comparisons comparison of 2 selected medians Copyright 2010 2014 PQStat Software All rights reserved 158 12 COMPARISON MORE THAN 2 GROUPS SIAI Ho Oj Oj 1 H Oj A Ogi i The value of critical difference is calculated by using the following formula NIR Za a c 6n where Za is the critical value statistic of the normal distribution for a given significance C level corrected on the number of possible simple comparisons c ii The test statistic is defined by k DE k k 1 6n where R mean of the ranks of the j th measurement for j 1 2 k The test statistic asymptotically for large sample size has normal distribution and the p value is corrected on the number of possible simple comparisons c The settings window with the Friedman ANOVA can be opened in Statistics menu gt NonParametric tests ordered categories Friedman ANOVA or in Wizard Friedman ANOVA Statistical analysis Fredman ANOVA Variable 1 Test options l aE 2 Quarter 3 Quarter 4 Quarter III F POST HOC Contrasts S4Quarter IV
220. h a situation shows that the products compete with each other i e the purchase of one product will exclude the purchase of the other one The formula of Jaccard s similarity coefficient can also be presented in the general form J os Doki Tie ki Tor Lk TIRT Copyright 2010 2014 PQStat Software All rights reserved 28 3 WORKING WITH DOCUMENTS sl proposed by Tanimoto 1957 An important feature of the Tanimoto formula is that it can also be calculated for continuous characteristics In the case of binary data Jaccard s and Tanimoto s dissimilarity similarity formulas are identical and fulfill the conditions of a metric For continuous variables the Tanimoto formula is not a met ric does not fulfill the conditions of the triangle inquality Example a comparison of species We compare the genetic similarity of the representatives of three different species in terms of the number of genes common to all the species If a gene is present in an organism we ascribe it value 1 In the opposite case we ascribe it value 0 For the sake of simplicity only 10 genes are subjected to the analysis GENS gen1 gen2 gen3 gen9 gen10 representativel 0 1 1 representative2 0 0 1 representative3 1 0 0 The calculated similarity matrix looks as follows REPRESENTATIVES representativel representative2 representative3 representativel 0 0 857143 0 375 representative 0 857143 0 0 428571 representative3 0 375 0 428571
221. h the significance level a 0 05 it may be assumed that students changed their opinions Looking at the table you can see that there were more students who changed their opinions to negative ones after the exam than those who changed it to positive ones after the exam There were also students who did not evaluate the professor in the positive way after the exam any more If you limit your analysis only to the people having clear opinions about the professor positive or neg ative ones you can use the McNemar test Hypotheses Ho the number of students who changed their opinions from negative to positive ones is exactly the same as those who changed their opinions from positive to negative H the number of students who changed their opinions from negative to positive ones is different from those who changed their opinions from positive to negative Analysis time Analysed variables Study 1 Study 2 negative positive Significance level 0 05 Continuity correction Yes Data Filter Study 1 lt gt I have no opinion and Study 2 lt gt I have no opi Size number of pairs lloraz szans 0 090909 95 CI for the Odds Ratio 0 052665 95 CI for the Odds Ratio 0 255007 Chi square statistic Degrees of freedom 1 p value 0 000001 Copyright 2010 2014 PQStat Software All rights reserved 140 11 COMPARISON 2 GROUPS oO negative lt gt positive 10 below main diagonal above main diagonal If you co
222. he above groups by the test may differ from the number of people gen uinely ill and genuinely healthy There are two evaluation measurements of the test accuracy They are Sensitivity describes the ability to detect people genuinely ill having a particular feature If we examine a group of ill people the sensitivity provides us with the information what percentage of them have a positive test result TP sensitivity TP FN Confidence interval is built on the basis of the Clopper Pearson method for a single propor tion Specificity describes the ability to detect people genuinely healthy without a particular fea ture If we examine a group of genuinely healthy people the specificity provides us with the information about the percentage of people having a negative test result TN specifi city FPLTN Confidence interval is built on the basis of the Clopper Pearson method for a single propor tion e Positive predictive values negative predictive values and prevalence rate Copyright 2010 2014 PQStat Software All rights reserved 206 16 DIAGNOSTIC TESTS sl Positive predictive value PPV the probability that a person having a positive test result suffered from a disease If the examined person obtains a positive test result the PPV in forms them how they can be sure that they suffer from a particular disease TP PPV _ Y TPLFP Confidence interval is built on the basis of the C
223. he algorithm should reach convergence and the convergence criterion it gives the value below which the received improvement of estimation shall be considered to be insignif icant and the algorithm will stop Copyright 2010 2014 PQStat Software All rights reserved 293 19 SURVIVAL ANALYSIS 19 4 1 Hazard ratio An individual hazard ratio HR is now calculated for each independent variable AR efi It expresses the change of the risk of a failure event when the independent variable grows by 1 unit The result is adjusted to the remaining independent variables in the model it is assumed that they remain stable while the studied independent variable grows by 1 unit The H R value is interpreted as follows e HR gt 1 means the stimulating influence of the studied independent variable on the occurrence of the failure event i e it gives information about how much greater the risk of the occurrence of the failure event is when the independent variable grows by 1 unit e HR lt 1 means the destimulating influence of the studied independent variable on the occur rence of the failure event i e it gives information about how much lower the risk is of the occur rence of the failure event when the independent variable grows by 1 unit e HR x 1 means that the studied independent variable has no influence on the occurrence of the failure event 1 Note If the analysis is made for a model other than linear or if interact
224. he assumption which also pertains to most parametric survival mod els i e hazard proportionality The function on which Cox proportional hazard model is based describes the resulting hazard and is the product of two values only one of which depends on time t h t X1 Xo Xk exp pep Soax I where h t X1 X2 Xk the resulting hazard describing the risk changing in time and depen dent on other factors e g the treatment method ho t the baseline hazard i e the hazard with the assumption that all the explanatory variables are equal to zero D Bi X a combination usually linear of independent variables and model parame ters X1 X9 X explanatory variables independent of time By Bo Bk parameters Copyright 2010 2014 PQStat Software All rights reserved 292 19 SURVIVAL ANALYSIS Dummy variables and interactions in the model A discussion of the coding of dummy variables and interactions is presented in chapter 17 1 Preparation of the variables for the analysis in multidimensional models Correction for ties in Cox regression is based on Breslow s method 14 The model can be transformed into a the linear form h t X1 X2 Xk O i In Aen Ree E 2 In such a case the solution of the equation is the vector of the estimates of parameters 6o 61 Ok called regression coefficients The coefficients are estimated by the so called partial maximum likelihood estim
225. he form presented below krm ERM 1 Rby The statistics is subject to F Snedecor distribution with df krm krm and dfo n kfm 1 degrees of freedom The p value designated on the basis of the test statistic is compared with the significance level a fp lt a gt reject Ho and accept H fp gt a gt there is no reason to reject Ho If the compared models do not differ significantly we should select the one with a smaller number of variables Because a lack of a difference means that the variables present in the full model but absent from the reduced model do not carry significant information However if the difference in the quality of model adequacy is statistically significant it means that one of them the one with the greater number of variables with a greater R is significantly better than the other one In the program PQStat the comparison of models can be done manually or automatically e Manual model comparison construction of 2 models a full model a model with a greater number of variables a reduced model a model with a smaller number of variables such a model is created from the full model by removing those variables which are superfluous from the perspective of studying a given phenomenon The choice of independent variables in the compared models and subsequently the choice of a better model on the basis of the results of the comparison is made by the researcher e Au
226. he hospital in which the transplanta tion took place For that purpose we will choose a hospital as the stratum variable Copyright 2010 2014 PQStat Software All rights reserved 288 Analysis time Analysed variables Significance level Grouping variable Strata variable Frequency Failure events Censored Test LogRank Strata hospital 1 Chi square statistic Degrees of freedom p value For trend Chi square statistic Degrees of freedom p value Strata hospital 2 Chi square statistic Degrees of freedom p value For trend Chi square statistic Degrees of freedom p value Common for stratas Chi square statistic Degrees of freedom p value For trend Chi square statistic Degrees of freedom p value Common strata hospital 1 hospital 1 hospital 1 hospital 2 hospital 2 hospital 2 Common Exp 1 3 1484541 20 3 3935138 14 458031 12 506498 17 490412 2 0030892 15 654952 20 883926 Obs Exp 0 3176161 0 8840394 1 1758170 0 7995843 0 9719610 2 4961444 0 7026530 0 9576743 19 SURVIVAL ANALYSIS time status 0 05 age ordinal 1 2 3 2 413735014 2 0 299132645 1 0 124714733 2 0 062578806 3 0286300936 1 0 0616235659 6 209957261 2 0 043731849 5 374392333 1 0 020434458 ee ee wm Common 16 461121 1 3364824 Copyright 2010 2014 PQStat Software All rights reserved 289 Survival function strata hospital 1 0 8
227. he model as dummy variables In such a case before the commencement of the analysis one should divide that variable into a few dummy variables with 2 categories EXAMPLE 17 2 c d anomaly pqs Let us once more construct a logistic regression model however this time let us divide the variable mother s education into dummy variables With this operation we lose the information about the or dering of the category of education but we gain the possibility of a more in depth analysis of particular categories The division into dummy variables was made by creating 3 variables concerning mother s education VocationalE 1 yes 0 no SecondaryE 1 yes 0 no TertiaryE 1 yes 0 no The primary education variable is missing as it will constitute the reference category intercept BirthWeight MAge FregNo SponAbort ResptTint Smoking VocationalE SecondaryE TertiaryE 1 665115 0 046576 0 438115 0 295937 0 034094 0 299655 0 491751 1 487811 1 457936 0 582289 0 871537 0 79061 0 693346 0 172997 0 17101 0 13156 0 018834 0 101019 0 306896 0 27766 0 414507 0 344821 0 332673 0 339937 95 CI 0 306183 0 385643 0 102942 0 55379 0 071008 0 101662 1 093256 0 943608 0 64552 1 358125 1 523565 1 496274 95 CI 3 024048 0 292491 0 773288 0 038084 0 00282 0 497648 0 109755 2 032014 2 270356 0 006452 0 21951 0 085347 Wald stat 5 767521 0 072464 6 565464 5 059981 3 277034 8 7991
228. he shorter the time needed for the com pletion of the task and if there is no disturbing agent the probability of correct solution is greater AGE OR 95 C I 0 90 0 85 0 96 TIME OR 95 CI 0 91 0 87 0 97 DISTURBANCES OR 95 CJ 0 15 0 06 0 37 The obtained results of the Odds Ratio are presented on the chart below Copyright 2010 2014 PQStat Software All rights reserved 256 17 MULTIDIMENSIONAL MODELS TIME at EDUCATION J oj AGE iat ADDRESSOFRES 2 0 05 1 1 5 2 20 Should the model be used for prediction one should pay attention to the quality of classification For that purpose we calculate the ROC curves ROC curves DeLong s method AUC 0 634599 SE AUC 0 035432 95 CI 0 765153 95 CI 0 904045 zZ statistic 6 469391 p value 0 000001 Cut off point 0 597865 Copyright 2010 2014 PQStat Software All rights reserved 257 STM 17 MULTIDIMENSIONAL MODELS A Cut off point 0 597804924 0 208 x 0 792 e i TEREA JU P arate af amar i ae oe oie a oe i fl fl i of iie 0 6 nE i gt e T e HE Ti T on H D wE i E i i HE oe ie ha a oe ije ae ob i ob H 0 2 4 ie F a oe oe i ae oh oa a of ii oi a oe ie ai of ofa a oe ie ae F ho 0 8 0 0 2 04 0 6 0 8 1 1 Specificity The result seems satisfactory The area under the curve is AUC 0 83 and is statistically greater than 0 5 p lt 0 000001 so classificat
229. he so called rule 5 are satisfied min Ol ols ol ol T Da p gt 5 for all the stratas s 1 2 e max 0 O O3 gt 5 for all the stratas s 1 2 w Hypotheses Ho ORmH 1 H ORmyp 1 The test statistic is defined by Dr DA 2 XMH Pp Copyright 2010 2014 PQStat Software All rights reserved 168 13 STRATIFIED ANALYSIS sl are the expected frequencies in the first con where ce _ CR 0 On 019 o a n s tingency table cell for the individual stratas s 1 2 w wW v 5 vo s 1 o _ Ov O12 08 02 OL 03 Ol O22 p O nl 1 This statistic asymptotically for large frequencies has the y distribution with 1 degree of free dom The p value designated on the basis of the test statistic is compared with the significance level Q fp lt a reject Ho and accept H fp gt a there is no reason to reject Ho The y test of homogeneity for the OR The Chi square test of homogeneity for the OR is used in the hypothesis verification that the variable creating stratas is the modifying effect i e it influences on the designated odds ratio in the manner that the odds ratios are significant different for individual stratas Hypotheses Ho ORmy OR for all the stratas s 1 2 w H ORMH OR for at least one strata The test statistic Breslow Day 1980 12 Tarone 1985 13 77 is defined by 2 2 ee OP BO Ceo Tk
230. he value of 0 5 The user can change the value into any value from the range of 0 1 e g the value suggested by the ROC curve EXAMPLE 17 2 anomaly pas file Studies have been conducted for the purpose of identifying the risk factors for a certain rare congenital anomaly in children 395 mothers of children with that anomaly and 375 of healthy children have participated in that study The gathered data are address of residence child s sex child s weight at birth mother s age number of pregnancy previous spontaneous abortions respiratory tract infections smoking mother s education We construct a logistic regression model to check which variables may have a significant influence on the occurrence of the anomaly The dependent variable is the column GROUP the distinguished values in that variable as 1 are the cases that are mothers of children with anomaly The following 9 variables are independent variables AddressOfRes 2 city 1 village Sex 1 male 0 female BirthWeight in kilograms with an accuracy of 0 5 kg MAge in years PregNo which pregnancy is the child from SponAbort 1 yes 0 no RespTInf 1 yes 0 no Smoking 1 yes O no MeEdu 1 primary or lower 2 vocational 3 secondary 4 tertiary Copyright 2010 2014 PQStat Software All rights reserved 250 17 MULTIDIMENSIONAL MODELS 0 65sec AddressOfRes Sex BirthWeic Analysis time Analysed variables Count of missing data Significance
231. he vari ables does not have a significant effect on the profit and can be superfluous For the model to be well formulated the interval independent variables ought to be strongly correlated with the dependent variable and be relatively weakly correlated with one another That can be checked by computing the correlation matrix and the covariance matrix prod c advert_c prom_c popular_at gross_profit prod_c advert_c popular_author gross_profit prod_c advert_c prom_c rebates popular_author The most coherent information which allows finding those variables in the model which are superfluous 1 0 770685 0 794845 0 071095 0 131924 0 553605 0 365914 2 392702 5 457308 0 770685 0 079792 0 092951 0 340624 prod_c 0 794845 0 556643 1 0 119669 0 056708 0 326678 advert_c 382 53540 48 679353 70 190897 48 679353 10 429429 70 190897 6 14659 6 14659 20 385641 0 071517 0 278372 0 554231 0 150231 0 237436 0 74359 0 071095 0 079792 0 119669 1 0 056478 0 049327 prom_c 0 385914 0 071517 0 150231 0 077025 0 014536 0 006897 0 151924 0 092951 0 056708 0 056478 1 0 010427 2 392752 0 278572 0 237436 0 014536 0 859974 0 004872 0 553803 0 340624 0 526676 0 049327 0 010427 is given by the parial and semipartial correlation analysis as well as redundancy analysis prod_c advert_c prom_c rebates 0 656793 0 690424 0 164797 0 171176 semipartia 0 3360
232. hich is 32 22 median is 6 Copyright 2010 2014 PQStat Software All rights reserved 111 11 COMPARISON 2 GROUPS 11 2 2 The Wilcoxon test matched pairs The Wilcoxon matched pairs test is also called as the Wilcoxon test for dependent groups Wilcoxon 1945 1949 It is used if the measurement of an analysed variable you do twice each time in different conditions It is the extension for the two dependent samples of the Wilcoxon test signed ranks designed for a one sample We want to check how big is the difference between the pairs of measurements d x1 X2 for each of 2 analysed objects This difference is used to verify the hypothesis determining that the median of the difference in the analysed population counts to O Basic assumptions measurement on an ordinal scale or on an interval scale a dependent model Hypotheses Ho o 0 H Oo 0 where o median of the differences d in a population The p value designated on the basis of the test statistic is compared with the significance level a fp lt a gt reject Ho and accept H ifp gt a gt there is no reason to reject Ho Note Depending on the sample size the test statistic is calculated by using different formulas e For small a sample size T min Y RY R where X R4 sums of positive ranks X R_ sums of negative ranks This statistic has the Wilcoxon distribution and does not contain any
233. high resolution 5 1 GRAPHS GALLERY According to the type of analysis there is a various choice of graphs 5 1 1 Bar plots Cumulative percent Copyright 2010 2014 PQStat Software All rights reserved 48 i WY I A lt ac O LN 400 A Aauanba4 100 ETP bt PF all t St t N O 1 Anuanbal War 49 Copyright 2010 2014 PQStat Software All rights reserved Frequency Copyright 2010 2014 PQStat Software All rights reserved 50 5 GRAPHS Temperature distribution Number of days F gs 8 5 p rp my e i m g p te Mee ae ae ae 1 1 j u rA a mA mA me ia a cd Co f T TE Temperature r2 r3 Copyright 2010 2014 PQStat Software All rights reserved 51 5 GRAPHS 35 30 20 a E a te y haere 15 ee oo aerated ee a e sps 10 an ma an Ma an Ba oan Ba an g aT a seen a a ae a t ek a t hare seed 5 ae Bares ea eae eee nae ears a a ar i Rees es ee e al ee I hawe no opinion rather no rather yes a a a oat ee areas es SE e a a e Are you in favor of the death E no opinion me E rather no B rather yes 80 60 40 20 Copyright 2010 2014 PQStat Software All rights reserved 52 5 GRAPHS D osteo Effect of coffee on ee eS day ra k tL 5 T iF E z i D E E 3 li 12 13 14 15 16 10 1 2 3 4 56 78
234. hout taking into consideration the effect of the remaining variables in the model e Covariance matrix similarly to the correlation matrix it contains information about the linear relation among particular variables That value is not standardized e Partial correlation coefficient falls within the range lt 1 1 gt and is the measure of correla tion between the specific independent variable X taking into account its correlation with the remaining variables in the model and the dependent variable Y taking into account its correla tion with the remaining variables in the model The square of that coefficient is the partial determination coefficient it falls within the range lt 0 1 gt and defines the relation of only the variance of the given independent variable X with that variance of the dependent variable Y which was not explained by other variables in the model The closer the value of those coefficients to 0 the more useless the information carried by the studied variable which means the variable is superfluous e Semipartial correlation coefficient falls within the range lt 1 1 gt and is the measure of correlation between the specific independent variable X taking into account its correlation with the remaining variables in the model and the dependent variable Y NOT taking into account its correlation with the remaining variables in the model The square of that coefficient is the semipartial dete
235. iable Group name Group size Group mean Group standard deviation Group name Group size Group mean Group standard deviation Difference of the means 95 CI for the difference 95 CI for the difference Std err of the difference Pooled standard deviation t statistic Degrees of freedom two sided p value Fisher Snedecor test Variance ratio F p value age company 0 05 No company transport comp transport company 1 50 30 26 5 23259 transport company 2 50 32 68 6 358154 2 42 4 730965 0 109035 1 164527 5 822634 0 677205 0 176168 Copyright 2010 2014 PQStat Software All rights reserved 106 11 COMPARISON 2 GROUPS 40 Mean EE 953 CI Stand dev 24 transport company 1 transport company 2 If you compare the p value 0 040314 with the significance level a 0 05 you draw the conclusion that the average age of all the workers chosen from both companies is different The first company workers are a little bit more than 2 years younger than the second company workers 11 1 4 The t test for dependent groups The t test for dependent groups is used when the measurement of an analysed variable you do twice each time in different conditions but you should assume that variances of the variable in both mea surements are pretty close to each other We want to check how big is the difference between the pairs of measurements d x1 X9 This differenc
236. iables are strongly correlated The correlation of those variables with the components which form the system is negative the vectors are in the third quadrant of the coordinate system The observed values of the coordinates of the vector are higher for the first component than for the second one Such a placement of vectors indicates that they comprise a uniform group which is represented mainly by the first component The vector of the width of the sepal points to an entirely different direction It is only slightly correlated with the remaining original values which is shown by the inclination angle with respect to the remaining original values it is nearly a right angle The correlation of that vector with the first component is positive and not very high the low value of the first coordinate of the terminal point of the vector and it is negative and high the high value of the second coordinate of the terminal point of the vector in the case of the second component From that we may infer that the width of the sepal is the only original variable which is well represented by the second component Biplot Factor 22 85 0 5 0 05 1 Factor 72 96 The biplot presents two series of data spread over the first two components One series are the vectors of original values which have been presented on the previous graph and the other series are the points which carry the information about particular flowers The values of the second
237. ic n total frequency in a contingency table The coefficient value is included in a range of lt 0 1 gt The closer to O the value of is the weaker dependence joins the analysed features and the closer to 1 the stronger dependence joins the anal ysed features The contingency coefficient is considered as statistically significant if the p value calculated on the basis of the x test designated for this table is equal to or less than the significance level a The settings window with the measures of correlation Q Yule Phi can be opened in Statistics menu NonParametric tests unordered categories Q Yule Phi 2x2 or in Wizard Copyright 2010 2014 PQStat Software All rights reserved 189 14 CORRELATION Q Yule Phi 2x2 Statistical analysis Measures of the correlation G Yule Phi Report options E Add analysed data Add graph C Add percentages Rows The Cramer s V contingency coefficient The Cramer s V contingency coefficient Cramer 1946 24 is an extension of the coefficient on r x c contingency tables x a n w 1 where x value of the y test statistic n total frequency in a contingency table w the smaller the value out of 7 and c The V coefficient value is included in a range of lt 0 1 gt The closer to O the value of V is the weaker dependence joins the analysed features and the closer to 1 the stronger dependence joins the analyse
238. icant diversity Standard errors they are not measures of a measurement dispersion They measure an accuracy level you can define the population parameters value having just the sample estimators Standard error of the mean is defined by SEM standard error of the mean A Jn Note On the basis of a sample estimator you can calculate a confidence interval for a population parameter 7 4 ANOTHER DISTRIBUTION CHARACTERISTICS Skewness or asymmetry coefficient in other words This measure tells us how data distribution differs from symmetrical distribution The closer the value of skewness is to zero the more symmetrically around the mean the data are spread Usually the value of this coefficient is included in a range 1 1 but in the case of a very big asymmetry it may occur outside the above mentioned range A positive skew value indicates that the right skew occurs the tail on the right side is longer whereas the negative skew indicates that the left skew occurs the tail on the left side is longer Skewness is defined by 4 Goer a where x the following values of a variable T sd adequately arithmetic mean and standard deviation n sample size left skew A lt 0O right skew A gt 0 frequency frequency Mode Med X X X Med Mode X Copyright 2010 2014 PQStat Software All rights reserved 70 7 DESCRIPTIVE STATISTICS Kurtosis or coefficient of concentration
239. ich divides two neighbouring columns to narrow or widen the column on the left side of above mentioned line Additionally you can set different colour of the background in each cell of a sheet when you select the Copyright 2010 2014 PQStat Software All rights reserved 12 3 WORKING WITH DOCUMENTS area you want to change To do this use button on the toolbar Cell colour command on the cell s context menu 3 1 5 DATA EDITING You can select the consistent area of a sheet using a mouse or a keyboard Keyboard arrows Shift While selecting an area its size is displayed currently on the Message box the number of rows and columns You can easily select the whole sheet by clicking the top left corner of the sheet or selecting from the menu Edit Select all Ctrl A If you want to select the whole columns or rows just click their headers Cell Copying and moving is performed with Copy Cut and Paste The above commands can be found in several places like Edit menu Context menu of each cell or cells O Al buttons on the toolbar Context menu of the columns and rows Shortcut keys Copy Ctrl C Cut Ctri X and Paste Ctrl V To delete data from cells select Edit Delete Del If you want to undo recent operations select Edit gt Undo Ctrl Z There are 10 recent operations au tomatically saved in a Program memory Each operation refers to maximum 5000 cells These settings may be chang
240. iduals on time Copyright 2010 2014 PQStat Software All rights reserved 296 19 SURVIVAL ANALYSIS An even distribution of points with respect to value O shows the lack of dependence of the resid uals on time i e the fulfillment of the assumption of hazard proportionality by a given variable in the model If the assumption of hazard proportionality is not fulfilled for any of the variables in Cox model one possible solution is to make Cox s analyses separately for each level of that variable 19 5 COMPARISON OF COX PH REGRESSION MODELS The window with settings for model comparison is accessed via the menu Statistics Survival analy sis Cox PH Regression comparing models Cox PH Regression comparing models Statistical analysis Cox Proportional Hazards Model comparison Variable X1 model pelny en 1 survival time weeks status D censored 1 relapse 4Hog WEC 44oqg WEC o E t easda Data Filter Set of the conditions that are applied to data to pa produce a subset of your data All the rules are combined using the logical AND J basic O multiple ano Add analysed data Mean std dev Add graph Due to the possibility of simultaneous analysis of many independent variables in one Cox regression model there is a problem of selection of an optimum model When choosing independent variables one has to remember to put into the model variables strongl
241. ient s survival time Analogously the beginning of the study does not have to be the same point in time for all patients 19 1 LIFE TABLES The window with settings for life tables is accessed via the menu Statistics gt Survival analysis Life tables We Life tables Statistical analysis Life tables Test options step 3 Comection of the lack of failure events Data Filter Set of the conditions that are applied to data to pa produce a subset of your data All the rules are combined using the logical AND MA basic multiple ann V Add graph ox Life tables are created for time ranges with equal spans provided by the researcher The ranges can be defined by giving the step For each range PQStat calculates e the number of entered cases the number of people who survived until the time defined by the range e the number of censored cases the number of people in a given range qualified as censored cases e the number of cases at risk the number of people in a given range minus a half of the censored cases in the given range e the number of complete cases the number of people who experienced the event i e died in a given range e proportions of of complete cases the proportion of the number of complete cases deaths in a given range to the number of the cases at risk in that range e proportions of the survival cases calculated as 1 minus the proportion of com
242. ights reserved 41 STM ee POStatv 14 0 C Program Files POStat Dane EN_sex education paqs File Edit Data Statistics Spatial analysis Help JOl GivgG AsS OS c EN_sex education a 2Rx 3C gt contingency table Copyright 2010 2014 PQStat Software All rights reserved 42 4 HOW TO ORGANISE WORK WITH PQSTAT 4 2 HOW TO REDUCE A DATASHEET WORKSPACE Usually the whole datasheet workspace is fully available for you while performing a statistical analysis However you can easily limit this area by selecting just a part of the sheet you want to analyse There are four possible ways to do this 1 Through activation deactivation Activation deactivation of cases is a global option superior with respect to other reductions of the area available in the program Cases rows indicated as deactivated are shaded in the data sheet and are not taken into account in statistical analyses In order to activate or deactivate selected cases one should choose one of the following options e select the rows in the data sheet and choose the option Activate Deactivate from the con text menu on their names e select the menu Edit Activate Deactivate filter h EXAMPLE 4 3 file filtr pqs We are going to conduct many statistical analyses on the data from the file filtr pqs The analysis will concern boys aged 16 or over For that purpose we define the rows which will not be analyzed we select the button Ed and
243. ilter so that the analysis will be carried out separately for individual subsets of data Results of the analyses will be returned in the following reports 4 4 INFORMATION GIVEN IN A REPORT Apart from basic settings which refer to the already done statistic analysis in the test window there is a possibility to e Add analysed data to a report Analysed data depending on the test are given to the report as a raw data as a contingency table Additionally it is possible to view contingency table of proportional values calculated from table raw table column total sum of the table Report options Add analysed data Add percentages Columns Ol l Sums e Add graph to a report To add an appropriate graph to the report select option Add graph in the window of a particular statistical analysis e Limitations of numbers of returned results If there are any statistical tests whose reports include a lot of results you can limit the amount of returned information by deselecting the option Full calculations Report options Add analysed data More results 4 5 MARKING OF STATISTICALLY SIGNIFICANT RESULTS In the report a p value of performed statistical test is marked with red colour only if the p value is less than a significance level defined by the user The default significance level for all tests is 0 05 You can change this setting permanently in the Settings window or just temporarily till the
244. ime for the completion of the task was limited to 45 minutes In the case of participants who completed the task before the deadline the actual time devoted to the completion of the task was recorded Variable SOLUTION yes no contains the result of the experiment i e the information about whether the task was solved correctly or not The remaining variables which could have influenced the result of the experiment are ADDRESSOFRES 1 city 0 village SEX 1 female O male AGE in years EDUCATION 1 primary 2 vocational 3 secondary 4 tertiary TIME needed for the completion of the task in minutes DISTURBANCES 1 yes 0 no On the basis of all those variables a logistic regression model was built in which the distinguished state of the variable SOLUTION was set to yes Analysis time Analysed variables ADDRESSOFRES SEX AGE E Significance level 0 05 Size Number of estimated parameters Frequency 0 no Frequency 1 yes Likelihood ratio test Log Likelihood 64 354117 2 Log Likelihood 128 708234 Log Likelihood intercept 87 88099 2 Log Likelihood intercept 175 761979 Chi square statistic 47 053745 rees of freedom 6 p value lt 0 000001 Pseudo R2 0 267713 R2 Nagelkerke 0 409674 R2 Coxa Snella 0 303684 Hosmer Lemeshow test Chi square statistic Degrees of freedom p value The adequacy quality is described by the coefficients Rb sugo 0 27 o A UAiR ee 0 30 The sufficient adequacy is also i
245. in which the flat is placed c 3 years old e Proximity of district A the time it takes to get to the center c 30 minutes e Proximity of a bus or tram stop c 80 m Number Floor Age Distance Proximity of on which the flat ofthe of the district of a bus or is located building center tram stop WW Wanted Flat 10 Flat 12 Flat 17 Flat 35 Flat 88 Flat 101 Flat 105 Flat 122 Flat 130 Flat 132 Flat 135 WWNOrRPN BWN WFP DN PUOONNBRORNE Let us note that the last characteristic i e the proximity of a bus or tram stop is expressed in much greater numbers than the remaining characteristics of the compared flats As a result that charac teristic will have a much greater influence on the obtained result of the distance matrix than the re maining characteristics In order to prevent it before the analysis we will normalize all character istics by choosing a common range for them from O to 1 For that purpose we will use the menu Data Normalization Standardization In the normalization window we set the Number of rooms as the input variable and the empty variable called Norm Number of rooms as the output variable the type of the normalization is min max normalization the min and max values are calculated from the sample by selecting the button Calculate from sample the result of the normalization will be returned to the datasheet after selecting the button Run The normalization is repeated for the foll
246. int coordinates should be interpreted as standardized values i e positive coordinates pointing to a value higher than the mean value of the principal component negative ones to a lower value and the higher the absolute value the further the points are from the mean If there are untypical observations on the graph i e outliers they can disturb the analysis and should be removed and the analysis should be made again The distances between the points show the similarity of cases the closer in the meaning of Euclidean distance they are to one another the more similar information is carried by the compared cases Orthographic projection of points on vectors are interpreted in the same manner as point coordinates i e projections onto axes but the interpretation concerns original variables and not principal Copyright 2010 2014 PQStat Software All rights reserved 267 18 DIMENSION REDUCTION AND GROUPING components The values placed at the end of a vector are greater than the mean value of the original variable and the values placed on the extension of the vector but in the opposite direc tion are values smaller than the mean 18 1 3 The criteria of dimension reduction There is not one universal criterion for the selection of the number of principal components For that reason it is recommended to make the selection with the help of several methods The percentage of explained variance The number of principal components t
247. ion is possible on the basis of the constructed model The suggested cut off point for the ROC curve is 0 60 and is slightly higher than the standard level used in regression i e 0 5 Classification made on the basis of that cut off point yields 78 46 correctly classified cases of which the correctly classified yes values constitute 77 92 sensitivity 95 C J 77 92 67 02 86 58 the no values constitute 79 25 specificity 95 C I 79 25 65 89 89 16 Observed value Predicted value Cut off point Ya correct Sensitivity oc 95 CI 67 02 95 Cl 66 50 Specificity cc 79 25 95 CI 65 89 95 CI 89 16 We can finish the analysis of classification at this stage or if the result is not satisfactory we can make a more detailed analysis of the ROC curve in module ROC curve Copyright 2010 2014 PQStat Software All rights reserved 258 17 MULTIDIMENSIONAL MODELS As we have assumed that classification on the basis of that model is satisfactory we can calculate the predicted value of a dependent variable for any conditions Let us check what odds of solving the task has a person whose ADDRESSOFRES 1 city SEX 1 female AGE 50 years EDUCATION 1 primary TIME needed for the completion of the task 20 minutes DISTURBANCES 1 yes For that purpose on the basis of the value of coefficient b we calculate the predicted probability prob ability of receiving the answer yes on cond
248. ion is taken into account then just as in the logistic regression model we can calculate the appropriate H R on the basis of the general formula which is a combination of independent variables 19 4 2 Model verification Statistical significance of particular variables in the model significance of the odds ratio On the basis of the coefficient and its error of estimation we can infer if the independent variable for which the coefficient was estimated has a significant effect on the dependent variable For that purpose we use Wald test Hypotheses aoe or equivalently oe Oe a pe e uo ais ORZI The Wald test statistics is calculated according to the formula ga l SEs The statistic asymptotically for large sizes has the y distribution with 1 degree of freedom On the basis of test statistics p value is estimated and then compared with the significance level Q fp lt a wereject Ho and accept H1 fp gt a gt there is no reason to reject Ho Copyright 2010 2014 PQStat Software All rights reserved 294 19 SURVIVAL ANALYSIS The quality of the constructed model A good model should fulfill two basic conditions it should fit well and be possibly simple The quality of Cox proportional hazard model can be evaluated with a few general measures based on L ry the maximum value of likelihood function of a full model with all variables Lo the maximum value of the likelihood function of a model which only
249. ion onto the original value sepal width but small values of the remaining original values negative values on the projection onto the extension of the vectors illustrating the remaining original values Copyright 2010 2014 PQStat Software All rights reserved 275 19 SURVIVAL ANALYSIS 19 SURVIVAL ANALYSIS Survival analysis is often used in medicine In other fields of study it is also called reliability analysis duration analysis or event history analysis Its main goal is to evaluate the remaining time of the survival of for example patients after an operation The tools used in the analysis are life tables and Kaplan Meier curves Another interesting aspect of that issue is comparing the survival time of for example patients treated according to different protocols For that purpose comparisons of two or more survival curves are used A number of methods regression models have also been created for studying the influence of various variables on the survival time In order to make the understanding of the issue easier the example of the length of the life of patients after a heart transplantation will be used to illustrate basic definitions Event is the change interesting to the researcher e g death Survival time is the period of time between the initial state and the occurrence of a given event e g the length of a patient s life after a heart transplantation Note In the analysis one column with the calcu
250. isions and about the a priori prevalence coefficient value provided by the user e Optimum cut off on ROC curve computed on the basis of sensitivity specificity costs of wrong decisions and the prevalence coefficient Errors which can be made when classifying the studied objects as belonging to group and group are false positive results F P and false negative results FN If committing those errors is equally costly ethical financial and other costs then in the field Cost FP and in the field Cost FN we enter the same positive value usually 1 However if we come to the conclusion that one type of error is encumbered with a greater cost than the other one then we will assign appropriately greater weight to it The optimum cut off value is calculated on the basis of sensitivity specificity and with the help of value m slope of the tangent line to the ROC curve The slope angle m is defined in relation to two values the costs of wrong decisions and the prevalence coefficient Normally the costs of wrong decisions have the value 1 and the prevalence coefficient is estimated from the sample Knowing a priori the prevalence coefficient Papriori and the costs of wrong decisions the user can influence the value m and consequently the search for an optimum cut off As a result the optimum cut off is determined to be such a value of the diagnostic variable for which the formula Sensitivity m 1 Specificit
251. ition of defining the values of dependent variables P Y yes ADDRESSOF RES SEX AGE EDUCATION TIME DISTURBANCES e7 23 0 45ADDRESSOFRES 0 45SEX 0 1AGE 0 46EDUCATION 0 09TIME 1 92DISTURBANCES e7 23 0 45ADDRESSOFRES 0 45SEX 0 1AGE 0 46EDUCATION 0 09TIME 1 92DISTURBANCES e 231 0 453 1 0 455 1 0 101 50 0 456 1 0 089 20 1 924 1 tet 231 0 453 1 0 455 1 0 101 50 0 456 1 0 089 20 1 924 T As a result of the calculation the program will return the result 1 ADDRESSOFRI 2 5EX 3 AGE 4 EDUCATION 5 TIME 6 DISTURBANCE 0 6 0 121512 D The obtained probability of solving the task is equal to 0 1215 so on the basis of the cut off 0 60 the predicted result is 0 which means the task was not solved correctly Copyright 2010 2014 PQStat Software All rights reserved 259 17 MULTIDIMENSIONAL MODELS 17 5 COMPARISON OF LOGISTIC REGRESSION MODELS The window with settings for model comparison is accessed via the menu Statistics Multidimensional models Logistic regression comparing models lt lt Statistical analysis Comparing Logistic Regression Models Variable Y Varable X1 2 full model Variable 1 x2 reduced model 1 ADDRESSOFRES 1 ADDRESSOFRES 1 ADDRESSOFRES 2 5EA 2 5EX 2 SEX SAGE AGE AGE 4 FDUCATION 4 EDUCATION 4 EDUCATION 5 TIME 5 TIME 5 TIME 6 DISTURBANCES B DISTURBANCES B DISTURBANCES 7 SOLUTION 7 SOLUTION Data
252. ity correction is defined by n n 1 r 2D o s ea E n n 1 2n 1 X t t 24 48 The settings window with the Wilcoxon test signed ranks can be opened in Statistics menu Non Parametric tests ordered categories gt Wilcoxon signed ranks or in Wizard Copyright 2010 2014 PQStat Software All rights reserved 92 10 COMPARISON 1 GROUP Wilcoxon signed ranks Statistical analysis Wilcoxon test siqned ranks Vanable Median No 3 Test options Data Filter set of the conditions that are applied to data to produce a subset of your data All the rules are combined using the logical AND basic 6 multiple ano Report options Add analysed data Add graph Example 10 1 cont courier pqs file Hypotheses Ho median of the number of awaiting days for the delivery which is supposed to be delivered by the analysed courier company is 3 H median of the number of awaiting days for the delivery which is supposed to be delivered by the analysed courier company is different from 3 Analysis time 0 03sec Analysed variables waiting days Significance level 0 05 Continuity correction Yes Group size Count of omitted values equal median Group median Hypothetical median Sum of negative ranks Sum of positive ranks t statistic p value exact 0 123212 zZ Statistic adjusted for ties 1 572575 p value asymptotic 0 115817 Copyright 2010 2014 PQStat Software All rights
253. jian Tilaki K O 1997 Sampling variability of nonparametric estimates of the areas under receiver operating characteristic curves an update Academic radiology 4 1 49 58 39 Hanley J A i McNeil M D 1982 The meaning and use of the area under a receiver operating characteristic ROC curve Radiology 143 1 29 36 40 Hanley J A i McNeil M D 1983 A method of comparing the areas under receiver operating characteristic curves derived from the same cases Radiology 148 839 843 Copyright 2010 2014 PQStat Software All rights reserved 315 REFERENCES W 41 Kaplan E L Meier P 1958 Nonparametric estimation from incomplete observations Journal of the American Statistical Association 53 457 481 42 Kendall M G 1938 A new measure ofrank correlation Biometrika 30 81 93 43 Kendall M G Babington Smith B 1939 The problem of m rankings Annals of Mathematical Statistics 10 275 287 44 Kleinbaum D G Klein M 2005 Survival Analysis A Self Learning Text Second Edition Statistics for Biology and Health 45 Kolmogorov A N 1933 Sulla deterrninazione empirica di una legge di distribuzione Giornde1l Inst Ital degli Art 4 89 91 46 Kruskal W H 1952 A nonparametric test for the several sample problem Annals of Mathematical Statistics 23 525 540 47 Kruskal W H Wallis W A 1952 Use of ranks in one criterion variance analysis Journal of the American Statistical Association
254. l of the National Cancer Institute 22 719 748 57 Mantel N 1963 Chi square tests with one degree of freedom Extensions of the Mantel Haenszel procedure J Am Statist Assoc 58 690 700 58 Mantel N 1966 Evaluation of Survival Data and Two New Rank Order Statistics Arising in Its Consideration Cancer Chemotherapy Reports 50 163 170 59 Marascuilo L A and McSweeney M 1977 Nonparametric and distribution free method for the social sciences Monterey CA Brooks Cole Publishing Company Copyright 2010 2014 PQStat Software All rights reserved 316 REFERENCES 60 Marascuilo L A and McSweeney M 1977 Nonparametric and distribution free method for the social sciences Monterey CA Brooks Cole Publishing Company 61 McNemar Q 1947 Note on the sampling error of the difference between correlated proportions or percentages Psychometrika 12 153 157 62 Mehta C R and Patel N R 1986 Algorithm 643 FEXACT A Fortran subroutine for Fisher s ex act test on unordered r c contingency tables ACM Transactions on Mathematical Software 12 154 161 63 Miettinen O S 1985 Theoretical Epidemiology Principles of Occurrence Research in Medicine John Wiley and Sons New York 64 Miettinen O S and Nurminen M 1985 Comparative analysis of two rates Statistics in Medicine 4 213 226 65 Newcombe R G 1998 Interval Estimation for the Difference Between Independent Proportions Comparison of Eleve
255. lI 2 57 1 24 5 34 and men OR 95 CI 1 71 0 78 3 76 The tables are homo geneous p 0 465049 Thus we can use the calculated odds ratio which is mutual for both tables OR js 95 Cl 2 13 1 24 3 65 Finally the obtained result indicates that the odds of the occur rence of leptospirosis antibodies is significantly greater among village inhabitants p 0 005169 13 1 2 The Mantel Haenszel relative risk If all tables created by individual stratas are homogeneous the y test of homogeneity for the RR can check this condition then on the basis of these tables the pooled relative risk with the confidence interval can be designated Such relative risk is a weighted mean for a relative risk designated for the individual stratas The usage of the weighted method proposed by Mantel and Haenszel allows to include the contribution of the strata weights Each strata of the input has an influence on the pooled relative risk construction the greater size of the strata the greater weight and the greater influence on the pooled relative risk Weights for individual stratas are designated according to the following formula of of of s g ao and the Mantel Haenszel relative risk RR 1 MH g where v Of of 09 a a a 5 o 9 s 1 The confidence interval for logRRmq is designated on the basis of the standard error calculated ac cording to the following formula SEMH RS where vos Ve s 1
256. lained by the variability of four initial principal components 18 1 2 Graphical interpretation A lot of information carried by the coefficients returned in the tables can be presented on one chart The ability to read charts allows a quick interpretation of many aspects of the conducted analysis The charts gather in one place the information concerning the mutual relationships among the components the original variables and the cases They give a general picture of the principal components analysis which makes them a very good summary of it Factor loadings graph The graph shows vectors connected with the beginning of the coordinate system which represent orig inal variables The vectors are placed on a plane defined by the two selected principal components factor 1 lt gt factor 2 A D The coordinates of the terminal points of the vector are the corresponding factor loadings of the vari ables Vector length represents the information content of an original variable carried by the principal com ponents which define the coordinate system The longer the vector the greater the contribution of the original variable to the components In the case of an analysis based on a correlation ma trix the loadings are correlations between original variables and principal components In such a case points fall into the unit circle It happens because the correlation coefficient cannot exceed Copyright 2010 2014 PQStat Software All
257. large expected frequencies has the y distribution with a 1 degree of freedom The p value designated on the basis of the test statistic is compared with the significance level a The settings window with the Chi square test OR RR 2x2 can be opened in Statistics menu gt NonParametric tests unordered categories gt Chi square OR RR 2x2 or in Wizard Chi square OR RR 2x2 Statistical analysis Chi square test OR RR 2x2 Report options Add analysed data Add graph Add percentages EXAMPLE 11 7 sex exam pqs file There is a sample consisting of 170 persons n 170 Using this sample you want to analyse 2 fea tures X sex Y exam passing Each of these features occurs in two categories X1 f Xo m Y1 yes Y2 no Based on the sample you want to get to know if there is any dependence between sex and exam passing in the above population The data distribution is presented in the contingency table be low Observed frequencies Vij o f 50 40 90 mf 20 60 8 woa 70 100 470 SEX Hypotheses Copyright 2010 2014 PQStat Software All rights reserved 126 11 COMPARISON 2 GROUPS Ho there is no dependence between sex and exam passing in the analysed population H there is a dependence between sex and exam passing in the analysed population Analysis time Analysed variables Significance level Odds Ratio 95 CI for the Odds Ratio 95 CI for the Odds Ratio Stati
258. lated time ought to be marked When we have at our disposal two points in time the initial and the final ones before the anal ysis we calculate the time between the two points using the datasheet formulas Censored observations are the observations for which we only have incomplete pieces of in formation about the survival time Censored and complete observations an example concerning the survival time after a heart transplantation a complete observation we know the date of the transplantation and the date of the patient s death so we can establish the exact survival time after the transplantation observation censored on the right side the date of the patient s death is not known the patient is alive when the study finishes so the exact survival time cannot be established observation censored on the left side the date of the heart transplantation is not known but we know it was before this study started and we cannot establish the exact survival time complete observation SS observation censored on the right side observation censored on the left side beginning end time of the study of the study Note Copyright 2010 2014 PQStat Software All rights reserved 276 19 SURVIVAL ANALYSIS The end of the study means the end of the observation of the patient It is not always the same moment for all patients It can be the moment of losing touch with the patient so we do not now the pat
259. le 2 1 BMI1 1 BMI1 2 BM I 2 BMI2 0 Averaged data Data Filter Set of the conditions that are applied to data to Ey produce a subset of your data All the rules are combined using the logical AND basic mutige W Report options E Add analysed data V Add graph Note Calculations can be based on raw data or data that are averaged like arithmetic mean of difference standard deviation of difference and sample size Copyright 2010 2014 PQStat Software All rights reserved 108 11 COMPARISON 2 GROUPS sty 11 2 NONPARAMETRIC TESTS 11 2 1 The Mann Whitney U test The Mann Whitney U test is also called as the Wilcoxon Mann Whitney test Mann and Whitney 1947 55 and Wilcoxon 1949 85 This test is used to verify a hypothesis determining insignificance of differ ences between medians of an analysed variable in 2 populations but you should assume that the dis tributions of a variable are pretty similar to each other Basic assumptions measurement on an ordinal scaleor on an interval scale an independent model Hypotheses Ho 01 Oo Hy 01 a Qo where 1 02 medians of an analysed variable of the 1st and the 2nd population The p value designated on the basis of the test statistic is compared with the significance level a fp lt a gt reject Ho and accept H ifp gt a gt there is no reason to reject Ho Note Depending on a sample size the test statistic is calculate
260. le yet but a change in the structure of tissues Observed frequencies Reality histopatology disease free positive result 9 10 19 mammography negative result 1 20 23T O wa o We will calculate the values enabling the assessment of the performed diagnostic test Copyright 2010 2014 PQStat Software All rights reserved 208 16 DIAGNOSTIC TESTS Analysis time 0 50sec Contingency table Analysed variables Significance level Sensitivity 95 CI 95 CI Specificity ci 47 3790 52 63 0 43 99 57 0 958333 95 CI 0 92471 95 CI 0 979841 Positive predictive value PPV 0 473684 95 CI 0 244475 95 CI 0 711357 Negative predictive value NPV 0 995671 95 CI 0 976118 95 CI Positive likelihood ratio PLR 95 CI 95 CI Negative likelihood ratio NLR 41 003106 0 104348 0 016251 0 670016 0 956 0 922637 0 977834 0 04 0 019345 0 072329 95 CI 95 CI Accuracy ACC 95 CI 95 CI Prevalence 95 CI 95 CI ang eaeee 0 8 3 sensitivity Fa 0 2 E F Ep TT TTE 0 0 2 0 4 0 6 0 8 1 1 Specificity e 90 of women suffering from breast cancer have been correctly defined so they have obtained the positive result of mammography Copyright 2010 2014 PQStat Software All rights reserved 209 16 DIAGNOSTIC TESTS SMA e 95 83 of healthy women not suffering from breast cancer have been c
261. lmogorov Smirnov or Lilliefors test Single sample t test 10 COMPARISON 1 GROUP Nominal scale x test goodness of fit tests for one proportion Copyright 2010 2014 PQStat Software All rights reserved 84 10 COMPARISON 1 GROUP D 10 1 PARAMETRIC TESTS 10 1 1 The t test for a single sample The single sample t test is used to verify the hypothesis that an analysed sample with the mean 7T comes from a population where mean u is a given value Basic assumptions measurement on an interval scale normality of distribution of an analysed feature Hypotheses Ho H Ho Hy H Ho where u mean of an analysed feature of the population represented by the sample uo a given value The test statistic is defined by where sd standard deviation from the sample n sample size The test statistic has the t Student distribution with n 1 degrees of freedom The p value designated on the basis of the test statistic is compared with the significance level a fp lt a gt reject Ho and accept H fp gt a gt there is no reason to reject Ho Note Note that If the sample is large and you know a standard deviation of the population then you can calculate a test statistic using the formula f 2 ES Jp o The statistic calculated this way has the normal distribution If oo t Student distribution converges to the normal distribution N 0 1 In pra
262. lopper Pearson method for a single propor tion Negative predictive value hypertargetNPV V PV the probability that a person having a neg ative test result did not suffer from any disease If the examined person obtains a negative test result the NPV informs them how they can be sure that they do not suffer from a par ticular disease TN NPV FN TN Confidence interval is built on the basis of the Clopper Pearson method for a single propor tion Positive and negative predictive values depend on the prevalence rate Prevalence probability of disease in the population for which the diagnostic test was con ducted TP FEFN n Confidence interval is built on the basis of the Clopper Pearson method for a single proportion prevalence e Likelihood ratio of positive test and likelihood ratio of negative test Likelihood ratio of positive test LR this measurement enables the comparison of some test results matching to the gold standard It does not depend on the prevalence of the disease It is the ratio of two odds the odds that a person from the group of ill people will obtain a positive test result and the same effect will be observed among healthy people sensitivity TP TP FN Lhe a u e specificity FP FP TN Confidence interval for L A is built on the basis of the standard error 1 sensitivity specificity f 4 e j Lr jj FP Likelihood ratio of negative test LR_ it is the
263. lue asymptotic 0 000229 Me oiGa3 L Min Max pain before pain after Comparing the p value 0 0001 of the Wilcoxon test based on the T statistic with the significance level 0 05 you assume that there is a statistically significant difference if concerning the level of felt pain between these 2 examinations The difference is that the level of pain decreased the sum of the negative ranks is significantly greater than the sum of the positive ranks Exactly the same decision you would make on the basis of p value 0 00021 or p value 0 00023 of the Wilcoxon test which is based on the Z statistic or the Z statistic with the continuity correction 11 2 3 TESTS FOR CONTINGENCY TABLES Tests for contingency tables can be calculated on the basis of the data gathered as contingency tables or in the form of a raw data But there is also a possibility to transform the data from the contingency table to the raw form or inversely Copyright 2010 2014 PQStat Software All rights reserved 114 ll 11 COMPARISON 2 GROUPS In the PQStat application there is a group of tests which can be used on the base of one form as well as the other one There are The x test for the trend for R x 2 tables The x test and the Fisher test for R x C tables The y test and the Fisher test for 2 x 2 tables and their corrections The McNemar test the Bowker test of the internal symmetry The Test of significance for Cohen s Kappa EXAMPL
264. ly different values The values can be predicted on the basis of the column for which the missing data are being replaced or on the basis of the values of other columns variables The missing data can be replaced with the following types of values e random values from the dataset e random values from the normal distribution defined on the basis of the mean and the stan dard deviation from the existing data Copyright 2010 2014 PQStat Software All rights reserved 22 3 WORKING WITH DOCUMENTS e random values from a range given by the user e calculated from the user s functions which allows the use of data from other variables so as to be able to predict the missing value in the selected column e calculated from the regression model which allows to predict the values of the missing data on the basis of a multiple regression model the manner in which multiple regression Operates was described in chapter Multiple linear regression e interpolation on the basis of the neighboring values it applies to time series so the user must point to the time variable which gives information about the data order the interpo lation consists in the determination of the value for the missing data in such a manner that they are placed graphically on a straight line joining the values of the data neighboring the missing data e the mean from the n of the neighbors it applies to time series so the user must point to the tim
265. m the average wages in the country assuming that the other variables in the model remain unchanged 17 1 2 Interactions Interactions are considered in multidimensional models Their presence means that the influence of the independent variable X1 on the dependent variable Y differs depending on the level of another independent variable X or a series of other independent variables To discuss the interactions in multidimensional models one must determine the variables informing about possible interactions i e the product of appropriate variables For that purpose we select the Interactions button in the window of the selected multidimensional analysis In the window of interactions settings with the CTRL button pressed we determine the variables which are to form interactions and transfer the variables into the neighboring list with the use of an arrow By pressing the OK button we will obtain appropriate columns in the datasheet In the analysis of the interaction the choice of appropriate coding of dichotomous variables allows the avoidance of the over parametrization related to interactions Over parametrization causes the effects of the lower order for dichotomous variables to be redundant with respect to the confounding interac tions of the higher order As a result the inclusion of the interactions of the higher order in the model annuls the effect of the interactions of the lower orders not allowing an appropriate evaluation of the
266. me feature and on the same objects then odds ratio for the result change from to and inversely is calculated for the table The odds for the result change from to is O12 and the odds for the result change from to is O21 Odds Ratio O R is 0O12 OR O21 Confidence interval for the odds ratio is calculated on the base of the standard error 1 1 SE Oi2 Oz Copyright 2010 2014 PQStat Software All rights reserved 137 11 COMPARISON 2 GROUPS The settings window with the Bowker McNemar test can be opened in Statistics menu NonPara metric tests unordered categories Bowker McNemar or in Wizard Bowker hichemar Statistical analysis Bowker McNemar test Vanable 1 Variable 2 1 No 1o l l 2 Study 1 O Contingency table Raw data 3Study 2 d Continuity comection Data Filter Vanable condition value 2 Study 1 have no opinion 3 Study 2 have no opinion Report options Add analysed data Add graph W Add percentages nce cia The Bowker test of internal symmetry The Bowker test of internal symmetry Bowker 1948 11 is an extension of the McNemar test for 2 variables with more than 2 categories c gt 2 It is used to verify the hypothesis determining the symmetry of 2 results of measurements executed twice X and X of X feature symmetry of 2 dependent variables xX j X2 An analysed feature may have more than 2 categories The
267. mp 1a negative intraclass coefficient is treated in the same ways as rrcc amp 0 e rioc amp Oa lack of an absolute concordance in individual objects assessments made by judges it is visible in a small variance between objects a small means difference between objects and in a large variance between judges assessments a significant means difference of assessments designated by k judges In addition an average intraclass correlation coefficient can be formulated as k ICC TE OC If we averaged these two judges assessments and used them as a one result the coefficient would not be directly related to the problem but to the reliability of the situation results The F test of significance for the intraclass correlation coefficient Basic assumptions measurement on an interval scale Copyright 2010 2014 PQStat Software All rights reserved 195 15 AGREEMENT ANALYSIS the normal distribution for all variables which are the differences of measurement pairs or the normal distribution for an analysed variable in each measurement Hypotheses Ho Rioc O0 Hi Ricco 0 Ricc 1 The test statistic is defined by _ MSps E M Sres This statistic has the F Snedecor distribution with dfgs n 1 and dfres n 1 k 1 degrees of freedom F The p value designated on the basis of the test statistic is compared with the significance level a fp lt a gt reject Ho and accept H fp gt
268. mpare the p value calculated for the McNemar test p value lt 0 000001 with the significance level a 0 05 you draw the conclusion that the students changed their opinions There were much more students who changed their opinions to negative ones after the exam than those who changed their opinions to positive ones The possibility of changing the opinion from positive before the exam to negative after the exam is eleven times greater than from negative to positive the chance to change opinion in the opposite direction is 7 0 090909 11 2 10 Z Test for two dependent proportions Z Test for two dependent proportions is used in situations similar to the McNemar s Test i e when we have 2 dependent groups of measurements X i X 2 in which we can obtain 2 possible results of the studied feature Observed sizes We can also calculated distinguished proportions for those groups p Cut rs ip 2 Cu Oot The test serves the purpose of verifying the hypothesis that the distinguished proportions P and P in the population from which the sample was drawn are equal Basic assumptions measurement on the nominal ordinal or interval scale dependent model large sample size Copyright 2010 2014 PQStat Software All rights reserved 141 11 COMPARISON 2 GROUPS Hypotheses Ho P Py 0 H Pi bP 0 where P P gt fractions for the first and the second measurement
269. my variable for sex the statistically significant coefficient b 0 5 which means that the average women s wages are a half of a thousand PLN lower than the average wages in the country assuming that the other variables in the model remain unchanged for the western region the statistically significant coefficient b 0 6 which means that the average wages of people living in the western region of the country are 0 6 thousand PLN higher than the average wages in the country assuming that the other variables in the model remain unchanged for the eastern region the statistically significant coefficient 6 1 which means that the average wages of people living in the eastern region of the country are a thousand PLN lower than the average wages in the country assuming that the other variables in the model remain unchanged for the northern region the statistically significant coefficient b 0 4 which means that the Copyright 2010 2014 PQStat Software All rights reserved 226 17 MULTIDIMENSIONAL MODELS average wages of people living in the western region of the country are 0 4 thousand PLN higher than the average wages in the country assuming that the other variables in the model remain unchanged for the southern region the statistically significant coefficient 6 0 1 which means that the average wages of people living in the southern region of the country do not differ in a statistically significant manner fro
270. n y TE 2 If we want to extend the transformed data in a different range then we ought to enter in the Normalization Standardization window the limits of the new range Normalizing function with a coefficient The normalization reduces the data to the range of 1 1 with the use of an S shaped function with the changing a normalization coefficient r ee 3 Vr a Copyright 2010 2014 PQStat Software All rights reserved 24 STM 3 WORKING WITH DOCUMENTS When the value of the a coefficient is raised a graph with a less steep slope is formed If we want to extend the transformed data in a different range then we ought to enter in the Normalization Standardization window the limits of the new range Standardization Standardization is the transformation of data as a result of which the mean of a variable is equal to 0 and its standard deviation is equal to 1 T 4 EXAMPLE 3 2 file normalization pqs Make the transformations of all the variables included in the file a using the minimum maximum normalization to the range 0 10 b using the logarithmic normalization c using the normalization with a coefficient d using standardization 3 1 13 SIMILARITY MATRIX The mutual relationships among objects can be expressed by their distances or more generally by the differences among them The further from one another the objects are the more they differ the closer they are they resemble one an
271. n Methods Statistics in Medicine 17 873 890 66 Newman S C 2001 Biostatistical Methods in Epidemiology 2nd ed New York John Wiley 67 Peduzzi P Concato J Feinstein A R Holford T R 1995 Importance of events per independent variable in proportional hazards regression analysis Il Accuracy and precision of regression esti mates Journal of Clinical Epidemiology 48 1503 1510 68 Plackett R L 1984 Discussion of Yates Tests of significance for 2x2 contingency tables Journal of Royal Statistical Society Series A 147 426 463 69 Pratt J W and Gibbons J D 1981 Concepts of Nonparametric Theory Springer Verlag New York 70 Robins J Breslow N and Greenland S 1986 Estimators of the Mantel Haenszel variance con sistent in both sparse data and large strata limiting models Biometrics 42 311 323 71 Robins J Greenland S and Breslow N E 1986 A general estimator for the variance of the Man tel Haenszel odds ratio American Journal of Epidemiology 124 719 723 72 Rothman K J Greenland S Lash T L 2008 Modern Epidemiology 3rd ed Lippincott Williams and Wilkins 221 225 73 Satterthwaite F E 1946 An approximate distribution of estimates ofvariance components Bior netrics Bulletin 2 1 10 1 14 74 Savin N E and White K J 1977 The Durbin Watson Test for Serial Correlation with Extreme Sam ple Sizes or Many Regressors Econometrica 45 1989 1996 75 Spearman C 1
272. n distin guished result with the size of m and you know how often these results occur in the sample we know a p proportion Depending on a sample size n you can choose the Z test for a one proportion for large samples and the exact binominal test for a one proportion for small sample sizes These tests are used to verify the hypothesis that the proportion in the population from which the sample is taken is a given value Basic assumptions measurement on a nominal scale alternatively an ordinal scale or an interval scale The additional condition for the Z test for proportion large frequencies according to Marascuilo and McSweeney interpretation 1977 60 each of these values np gt 5 and n 1 p gt 5 Hypotheses Ho P Po H p po where p probability distinguished proportion in the population po expected probability expected proportion The Z test for one proportion The test statistic is defined by P Po Z where p distinguished proportion for the sample taken from the population m frequency of values distinguished in the sample n sample size Copyright 2010 2014 PQStat Software All rights reserved 97 10 COMPARISON 1 GROUP D The test statistic with a continuity correction is defined by 1 Ip Pol an po 1 po n Z The Z statistic with and without a continuity correction asymptotically for large sizes has the normal
273. n j th measurement j 1 2 k The test statistic asymptotically for large sample size has the normal distribution and the p value is corrected on the number of possible simple comparisons c The settings window with the Cochran Q ANOVA can be opened in Statistics menu NonParametric tests unordered categories gt Cochran Q ANOVA or in Wizard QO Cochran ANOVA Statistical analysis Cochran ANOVA w Vanable No Test options answer task 1 POST HOC Lh answer task a Siak 4 answer task 3 Data Filter set of the conditions that are applied to data to G3 produce a subset of your data All the rules are combined using the logical AND basic multiple AND Report options E Add analysed data Add graph Add percentages Sums Note This test can be calculated only on the basis of raw data Copyright 2010 2014 PQStat Software All rights reserved 164 12 COMPARISON MORE THAN 2 GROUPS EXAMPLE 12 3 test paqs file We want to compare the difficulty of 3 test questions To do this we select a sample of 20 people from the analysed population Every person from the sample answers 3 test questions Next we check the correctness of answers an answer can be correct or wrong In the table there are following scores question 1 answer question 2 answer question 3 answer Hypotheses 1 2 3 4 5 6 7 8 9 correct wrong correct wrong wrong wrong wrong wrong corr
274. n only from selected rows in the formula window Operators addition subtraction multiplication division modulo division as a result the remainder of division of one number by another gt greater lt lower equal Mathematical functions Mathematical functions require numeric arguments In v1 returns a natural logarithm of the given number log10 v1 returns a logarithm to the base 10 of the given number logn v1 returns a logarithm to the base n of the given number sqr v1 returns a value of the given number raised to the 2nd power sqrt v1 returns a value of the square root of the given number fact v1 returns a value of factorial of the given number degrad v1 returns the angle in radians argument are degrees raddeg v1 returns the angle in degrees argument are radians sin v1 returns sinus of the given angle argument are radians cos v1 returns cosinus of the given angle argument are radians tan v1 returns tangens of the given angle argument are radians ctng v1 returns cotangens of the given angle argument are radians arcsin v1 returns arcus sinus of the given angle argument are radians arctan v1 returns arcus tangens of the given angle argument are radians exp v1 returns e raised to the power of the given number frac v1 returns the fractional part of the given number int v1 returns the integer part of the given number
275. n the table below vel commitment group of steady viewers rather small ana high very high The new viewers consist of 25 of all the analysed viewers This proportion is not the same for each level of commitment but looks like this Copyright 2010 2014 PQStat Software All rights reserved 119 11 COMPARISON 2 GROUPS group of steady viewers Level of commitment rather small p1 50 00 50 00 100 average p2 34 21 65 79 100 p3 34 09 65 91 100 p4 19 51 80 49 very high p5 18 98 81 02 100 total 25 00 75 00 100 Hypotheses Ho inthe population of the soap opera viewers the trend in proportions of P1 P2 P3 P4 ps does not exist H inthe population of the soap opera viewers the trend in proportions of P1 P2 P3 P4 ps does exists group of n group of s 3 5 10 5 9 5 28 5 22 66 Analysis time Analysed variables level of commitment group Significance level 0 05 Group sizel Group size Chi square statistic 12 3702523 Degrees of freedom 1 p value 0 0004362 group of n group of s group of new viewers 7 7 linear 13 25 24 39 26 111 50 409 group of n group of s 505o 50 o4 21 65 79 Oo 34 09 65 91 19 51 60 49 18 98 61 02 S banan 30 20 10 1 2 3 4 5 level of commitment The p value 0 000436 compared with the significance a 0 05 supports the alternative hypo
276. nd exam passing in the analysed population Significantly 50 the exam was passed more often by women 3 55 560 out of all the women in the sample who passed the exam than by men 25 00 out of all the men in the sample who passed the exam The mid p The mid p is the Fisher exact test correction This modified p value is recommended by many statisti cians Lancaster 1961 48 Anscombe 1981 4 Pratt and Gibbons 1981 69 Plackett 1984 68 Mietti nen 1985 63 and Barnard 1989 6 Rothman 2008 72 as a method used in decreasing the Fisher exact test conservatism As a result using the mid p the null hypothesis is rejected much more qucikly than by using the Fisher exact test For large samples a p value is calculated by using the y test with the Yate s correction and the Fisher test givs quite similar results But a p value of the y test without any correction corresponds with the mid p The p value of the mid p is calculated by the transformation of the probability value for the Fisher exact test The one sided p value is calculated by using the following formula PI mid p PI Fisher 0 5 Fpoint given table where PI mid p one sided p value of mid p PI Fisher one sided p value of Fisher exact test and the two sided p value is defined as a doubled value of the smaller one sided probability PII mid p 2PI mid p where PII mid p two sided p value of mid p EXAMPLE 11 7 c
277. ndent variable Y for category 1 to the reference category corrected with the remaining variables in the model k gt 2 Ifthe analyzed variable has more than two categories then k categories are represented by k 1 dummy variables with dummy coding When creating variables with dummy coding one selects a category for which no dummy category is created That category is treated as a reference category as the value of each variable coded in the dummy coding is equal to 0 0 2cm When the X1 X X _ 1 variables obtained in that way with dummy coding are placed in a regression model then their b1 bo bg 1 coefficients will be calculated b isthe reference of the Y results for codes 1 in X1 to the reference category corrected with the remaining variables in the model b is the reference of the Y results for codes 1 in X to the reference category corrected with the remaining variables in the model bg is the reference of the Y results for codes 1 in _X _ to the reference category cor rected with the remaining variables in the model Example We code in accordance with dummy coding the sex variable with two categories the male sex will be selected as the reference category and the education variable with 4 categories elemen tary education will be selected as the reference category Copyright 2010 2014 PQStat Software All rights reserved 224 17 MULTIDIMENSIONAL MODELS SAI Coded education
278. ndependent model equality of variances of an analysed variable in all populations Hypotheses Ho H H2 Hk H notall u are equal j 1 2 k where U1 H2 Hk means of an analysed variable of each population The test statistic is defined by MS Pe BG MSwe where gg MSpe os mean square between groups SS M SwG TE mean square within groups k oe 2 2 y r y Pele j 1 2j 1 Vij Dope dese Fis o between groups sum of squares j 1 l SSwga S Sr SSpBe within groups sum of squares 2 k nN a n or 2 vi SSr X p ea o total sum of squares N j 1 i 1 dfgg k 1 between groups degrees of freedom dfwa dfr dfpq within groups degrees of freedom dfr N 1 total degrees of freedom k N Dr hy nj samples sizes for j 1 2 k zij values of a variable taken from a sample for i 1 2 n j 1 2 4 The F statistic has the F Snedecor distribution with df gg and dfwa degrees of freedom The p value designated on the basis of the test statistic is compared with the significance level a fp lt a gt reject Ho and accept H ifp gt a gt there is no reason to reject Ho Copyright 2010 2014 PQStat Software All rights reserved 145 12 COMPARISON MORE THAN 2 GROUPS D An analysis of the variance enables you to get information only if there are any significant differences among populations
279. ndicated by the result of the Hosmer Lemeshow test p 0 1725 The whole model is statistically significant which is indicated by the result of the Likelihood Ratio test p lt 0 000001 Copyright 2010 2014 PQStat Software All rights reserved 255 intercept ADDRESSOFRES SEX AGE EDUCATION TIME DISTURBANCES 7 230601 0 453242 0 454788 0 100896 0 455926 0 089395 1 924 b error 1 870134 0 450524 0 451304 0 03159 0 241805 0 027609 0 475056 95 CI 3 565206 1 336253 1 339327 0 162812 0 018 0 143507 2 855092 95 CI Wald stat 10 895997 14 9465697 0 429769 0 429751 0 03898 0 929857 0 035282 0 992908 1 012102 1 015501 10 200921 3 555199 10 483921 16 402912 17 p value 0 00011 0 5144 0 313589 0 001404 0 059359 0 001204 0 000051 MULTIDIMENSIONAL MODELS odds ratio 1361 0525 0 635564 0 634562 0 904027 1 577637 0 914454 0 146022 95 CI 35 346726 53959 9005 0 262829 0 262022 0 849751 0 962161 0 666314 0 057551 1 556902 1 536675 0 96177 2 534146 0 965333 0 370496 The observed values and predicted probability can be observed on the chart observed values ca 02 T OO RO 0i 0 4 0 6 0 8 1 predicted probability In the model the variables which have a significant influence on the result are AGE p 0 0014 TIME p 0 0012 DISTURBANCES p 0 0001 What is more the younger the person who solves the task t
280. nditions that are applied to data to produce a subset of your data All the rules are combined using the logical AND basic mlipe W Report options Add analysed data Add graph ee EXAMPLE 14 1 continuation age height pqs file Hypotheses Ho there is no monotonic dependence between age and height for the population of children attending to the analysed school H there is a monotonic dependence between age and height for the population of children attending to the analysed school Copyright 2010 2014 PQStat Software All rights reserved 187 STA 14 CORRELATION oO Analysis time Analysed variables age hight Significance level Size number of pairs tau 2 statistic for tau p value asymptotic 135 130 125 age Comparing the p value 000098 with the significance level a 0 05 we draw the conclusion that there is a monotonic dependence between age and height in the population of children attending to the analysed school This dependence is directly proportional it means that children grow up as they get older The Kendall s correlation coefficient so the strength of a monotonic relation between age and height counts to T 0 7212 14 2 4 CONTINGENCY TABLES COEFFICIENTS AND THEIR STATISTICAL SIGNIFICANCE The contingency coefficients are calculated for the raw data or the data gathered in a contingency table look at the table 11 1 The Yule s Q
281. nfer if the independent variable for which the coefficient was estimated has a significant effect on the dependent variable For that purpose we use Wald test Hypotheses Ho 6 0 Hy B 0 or equivalently noe Ola I Fae ORF The Wald test statistics is calculated according to the formula Cale l SEs The statistic asymptotically for large sizes has the x7 distribution with 1 degree of freedom On the basis of test statistics p value is estimated and then compared with the significance level Q Copyright 2010 2014 PQStat Software All rights reserved 247 17 MULTIDIMENSIONAL MODELS ifpo lt a wereject Ho and accept H1 ifp gt a gt _ there is no reason to reject Ho The quality of the constructed model of multiple linear regression can be evaluated with the help of several measures e Pseudo R is a goodness of fit measure of the model an equivalent of the coefficient of multiple determination R defined for multiple linear regression The value of that coefficient falls within the range of lt 0 1 where values close to 1 mean excellent goodness of fit of a model 0 a complete lack of fit Coefficient D ee is calcu lated according to the formula In LFM R s _ Pseudo In Lo where Lory the maximum value of likelihood function of a full model with all variables Lo the maximum value of likelihood function of a model which only contains a intercept As coeffi
282. ng diagnosis is the cost of assuming that the patient suffers from bacteremia although in reality he or she is not suffering from it costs of a falsely positive decision As the FN costs are much more serious than the FP costs we enter a greater value in field one than in field two We decided the value would be 5 The PCT value is to be used in screening so we do not give the prevalence coefficient for the population a priori prevalence coefficient which is very low but we use the estimated coefficient from the sample We do so in order not to move the cut off of the PCT value too high and not to increase the number of falsely negative results For cut off Cost FN wrong diagnosis Cost FP wrong diagnosis Cut off point 1 619 0 040 x 0 848 0 8 a m Sensitivity p s 0 2 0 0 2 04 0 6 0 6 1 1 Specificity The optimal PCT cut off determined in this way is 1 819 For this point sensitivity 0 85 and speci ficity 0 96 Another method of selecting the cut off is the anlysis of the costs graph and of the sensitivity intersec tion graph Copyright 2010 2014 PQStat Software All rights reserved 216 16 DIAGNOSTIC TESTS Diagnostic variable 1 071 x 6 697 0 000 Sensitivity Specific The analysis of the costs graph shows that the minimum of the costs of wrong decisions lies at PCT 1 819 The value of sensitivity and specificity is similar at PCT 1 071 16 2 2 ROC curves comparison
283. ning variables do not change e If the sum of the discounts made increases by 1 thousand dollars then gross profit will increase by about 1 42 thousand dollars assuming that the remaining variables do not change e If the book has been written by a known author marked as 1 then in the model the author s popularity is assumed to be the value 1 and we get the equation Pro ityoss 14 33 2 56 Gopoa 42 Cady FEC Geom 142 ciscounts If the book has been written by an unknown author marked as 0 then in the model the author s popularity is assumed to be the value O and we get the equation PTO ilaross 418 42 901 Cora F 2 Cady TEOT Cron 4 142 discounts Copyright 2010 2014 PQStat Software All rights reserved 234 17 MULTIDIMENSIONAL MODELS The result of t test for each variable shows that only the production cost advertising costs and author s popularity have a significant influence on the profit gained At the same time that standardized coeffi cients b are the greatest for those variables Additionally the model is very well fitting which is confirmed by the small standard error of estima tion SE 8 086501 the high value of the multiple determination coefficient R 0 850974 the corrected multiple determination coefficient Roa 0 829059 and the result of the F test of variance analysis p lt 0 000001 On the basis of the interpretation of the results obtained so far we can assume that a part of t
284. nm pes Survival probability 0 2 0 5 10 15 20 Time 0 8 sE pis Survival probability Pi The report contains firstly an analysis of the strata both the test results and the hazard ratio In the first stratum the growing trend of hazard is visible but not significant In the second stratum a trend with the same direction a result bordering on statistical significance is observed A cumulation of those trends in a common analysis of strata allowed the obtainment of the significance of the trend of the survival curves Thus the older the patient at the time of a transplantation the lower the probability of survival over a certain period of time independently from the hospital in which the transplantation took place A comparative analysis of the survival curves corrected by strata yields a result significant for the log rank and Tarone Ware tests and not significant for Gehan s test which might mean that the differences among the curves are not so visible in the initial survival periods as in the later ones By looking at the hazard ratio of the curves compared in pairs Copyright 2010 2014 PQStat Software All rights reserved 290 Hazard r strata O3 CI 95 CI hospital 1 i lt gt 2 0 3592782 0 0775124 1 5652928 hospital 1 1 lt gt 3 0 2701238 0 0798335 0 9139877 hospital 1 2 lt gt 3 0 7518511 0 2305103 2 4522984 hospital 2 1 lt gt 2 0 8226505 0 3
285. nmodified Euclidean distance the flats best suited to the client s conditions are no 35 and 135 Having considered the weights the flats best suited to the client s conditions will be no 17 and no 132 which are the first flats with the number of rooms 3 and the distance to the district center similar to that requested by the client The other 3 characteristics have a smaller influence on the result Copyright 2010 2014 PQStat Software All rights reserved 34 3 WORKING WITH DOCUMENTS 3 2 HOW TO WORK WITH REPORTS RESULTS SHEETS A report is a project element which enables you to store the results of an already done statistic analysis The report is added automatically to the project and ascribed to the active datasheet at the moment of finishing the current statistic procedure Note that it can not be edited except for graphs and title Edition of the graph is run by double clicking the mouse or through the context menu of the right mouse button Title edition is done in the Project Manager by adding or changing the description The main operations of the report can be done via the context menu in the report window Mann Whitney U test Note Title Description E Print Ctrl P 0 Copy Report Shift Ctrl c Export Report to RIF Export Report to PDF Shitt Ctrl P Export Report to XML Shitt Ctrl X f Edit Graph Copy Graph Print Graph Save Graph as e Printing The options of printing are availa
286. no V Add graph As with survival tables we calculate the survival function i e the probability of survival until a certain time The graph of the Kaplan Meier survival function is created by a step function The point of time at which the value of the function is 0 5 is the survival time median That is the time of the observation below which half of the observed patients have died and half of them are still alive Both the median and other percentiles are determined as the shortest survival time for which the survival function is smaller or equal to a given percentile The survival time mean is determined as the field under the survival curve The data concerning the survival time are usually very heavily skewed so in the survival analysis the median is a better measure of the central tendency than the mean Example 19 1 continued file transplant pqs We present the survival time after a liver transplantation with the use of the Kaplan Meier curve Copyright 2010 2014 PQStat Software All rights reserved 281 19 SURVIVAL ANALYSIS Survival function Failure events O Censored to n 0 6 Survival probability fm 02 Analysis time Analysed variables time status Censored variable status dead alive Frequency Failure events Frequency Percent Survival time Lower quartile Median Upper quartile Mean 10 954902442 The survival function does not suddenly plunge right aft
287. nsport company Note Copyright 2010 2014 PQStat Software All rights reserved 150 12 COMPARISON MORE THAN 2 GROUPS D The assumptions for the single factor analysis of variance are fulfilled e the age has the normal distribution in each of the analysed transport company the p value of the Lilliefors test adequately counts to p 0 134516 p 0 603209 and p 0 607648 e the Brown Forsythe test indicates that there are no significant differences in the variances of the transport companies workers age p 0 430173 12 1 3 The Brown Forsythe test and the Levene test Both tests the Levene test Levene 1960 50 and the Brown Forsythe test Brown and Forsythe 1974 16 are used to verify the hypothesis determining the equality of variance of an analysed variable in several k gt 2 populations Basic assumptions e measurement on an interval scale e normality of distribution of an analysed feature in each population e an independent model Hypotheses Ho 60 Ss H notall o are equal j 1 2 k where O 05 07 variances of an analysed variable of each population The analysis is based on calculating the absolute deviation of measurement results from the mean in the Levene test or from the median in the Brown Forsythe test in each of the analysed groups This absolute deviation is the set of data which are under the same procedure performed to the analysis of variance for independe
288. nt groups Hence the test statistic is defined by pe MSBG MSwe The test statistic has the F Snedecor distribution with df gg and dfwa degrees of freedom The p value designated on the basis of the test statistic is compared with the significance level a fp lt a gt reject Ho and accept H ifp gt a gt there is no reason to reject Ho Note The Brown Forsythe test is less sensitive than the Levene test in terms of an unfulfilled assumption relating to distribution normality The settings window with the Levene Brown Forsythe tests can be opened in Statistics menu gt Parametric tests Levene Brown Forsythe Copyright 2010 2014 PQStat Software All rights reserved 151 12 COMPARISON MORE THAN 2 GROUPS Levene Brown Forsythe Statistical analysis Levene Browna Forsythe test 7 Variable Grouping vanable Test options age 1 20 2 company fg SELL Use the grouping variable Data Filter set of the conditions that are applied to data to produce a subset of your data All the rules are combined using the logical AND basic mutple W Report options Add analysed data More results C Add graph 12 1 4 The ANOVA for dependent groups The single factor repeated measures analysis of variance ANOVA for dependent groups is used when the measurements of an analysed variable are made several times k gt 2 each time in different con ditions but we need to assume that the variances of
289. ntiles C 2 1 2 99 into 100 equal parts The second quartile the fifth decile and the fiftieth centile are equal to median These measures can be used in an interval or ordinal scale Copyright 2010 2014 PQStat Software All rights reserved 68 7 DESCRIPTIVE STATISTICS 7 3 MEASURES OF VARIABILITY DISPERSION Central tendency measures knowledge is not enough to fully describe a statistical data collection struc ture The researched groups may have various variation levels of a feature you want to analyse You need some formulas then which enable you to calculate values of variability of the features Measures of variability are calculated only for an interval scale because they are based on the distance between the points Range is formulated I maxx min zi where z are values of the analysed variable IQR Interquartile range Q3 Q1 where Q1 Q3 are the lower and the upper quartile Ranges for a percentile scale decile centile Ranges between percentiles are one of the dispersion measures They define a percentage of all ob servations which are located between the chosen percentiles Variance measures a degree of spread of the measurements around arithmetic mean sample variance n m lti T sd 2 iat n 1 where z are following values of variable and 7 is an arithmetic mean of these values n sample size population variance N 2 dani ti u SS CS N
290. ntingency table When the user chooses the kind of an analysis a graph will occur The graph is divided according to a scale on which the measurement of the analysed features was done interval scale ordinal scale nominal scale Copyright 2010 2014 PQStat Software All rights reserved 311 22 OTHER NOTES SA t test for independent groups interval ordinal Hypotheses _ 12 m Ho Hy H2 Hi H F M2 where Hi Ha means of an analysed variable of the lst and the 2nd population Are the data normally distributed a Note If the variations of analysed variables in both populations are different then instead of the t Student test for independent groups the Are the data dependent correction for this test so called the Corchan Cox adjustement is calculated YES NO t test independent Xai The user moves on the graph by selecting the adequate answers to the asked questions After the user gets through the way on the graph chosen by himself he is able to perform this test which according to the replies is an appropriate one to solve the determined statistical problem 22 OTHER NOTES 22 1 FILES FORMAT PQS default file format for PQStat files is used for representing all objects created with PQStat project datasheet report graph PQX XML file for PQStat is used for representing all objects created with PQStat PQX files are stored in Unicode text format
291. ntral tendency measures are so called average or mean measures whose characteristic is mean or a typical level of a feature value Arithmetic mean is formulated Da La ee ee ia 7 are Do re where x means following values of variable and n means a sample size Arithmetic mean is used for an interval scale If used for a sample it should be marked with z but for population with u Geometric mean is formulated TO P T Ern This mean is used for an interval scale if the variable distribution is log normal so the variable logarithm has a normal distribution Harmonic mean is formulated er a ae This mean is used for an interval scale Median In the ordered data set median is the value that divides this set into two equal parts Half of all obser vations is below and half of them is above the median min 50 median 50 max Median can be used in both interval and ordinal scale Mode Mode is a value that occurs the most often among the results Mode can be used in each measurement scale Copyright 2010 2014 PQStat Software All rights reserved 67 7 2 2 ANOTHER MEASURES OF POSITION Quartiles deciles centiles 25 C75 upper quartile Q3 25 Cso median Qo 25 C o5 lower quartile Q 25 min 7 DESCRIPTIVE STATISTICS sl Quartiles Q1 Q2 Q3 divide an ordered rank into 4 equal parts deciles D 7 1 2 9 divide an ordered rank into 10 equal parts and centiles perce
292. o be assumed by the researcher depends on the extent to which they represent original variables i e on the variance of original variables they explain All principal components explain 100 of the variance of original variables If the sum of the variances for a few initial components constitutes a large part of the total variance of original variables then principal components can satisfactorily replace original variables It is assumed that the variance should be reflected in principal components to the extent of over 80 percent Kaiser criterion According to the Kaiser criterion the principal components we want to leave for interpretation should have at least the same variance as any standardized original variable As the variance of every standardized original variable equals 1 according to Kaiser criterion the important principal components are those the eigenvalue of which exceeds or is near value 1 Scree plot The graph presents the pace of the decrease of eigenvalues i e the percentage of explained variance scree end Eigenvalues 0 1 2 3 4 5 6 T 8 9 Numbers of eigenvalues The moment on the chart in which the process stabilizes and the decreasing line changes into a horizontal one is the so called end of the scree the end of sprinkling of the information about the original values carried by principal components The components on the right from the point which ends the scree represent a very small variance and are for the m
293. of the 3 curves above differs from the other curves Analysis time Analysed variables Significance level Grouping variable Failure events Censored Test LogRank Chi square statistic Degrees of freedom p value lt 55 60 age nominal lt 45 50 lt 50 Group Obs Exp Obs Exp lt 45 50 11 16 120092 0 6823782 lt 50 55 20 21 490397 0 9306482 22 15 389510 1 4295451 Group Hazard r 95 CI 95 CI lt 45 50 lt 0 7332289 0 3843909 1 3986403 lt 45 50 lt 0 4773394 0 2373928 0 9598135 lt 50 55 lt 0 6510100 0 3383317 1 2526581 Copyright 2010 2014 PQStat Software All rights reserved 287 i Survival function 0 8 nm be Survival probability Pi Time On the basis of the significance level a 0 05 based on the obtained value p 0 0692 in the log rank test p 0 09279 for Gehan s and p 0 0779 for Tarone Ware we conclude that there is no basis for the rejection of the hypothesis Ho The length of life calculated for the patients in the three compared age groups is similar However it is noticeable that the values are quite near to the standard significance level 0 05 When examining the hazard values the ratio of the observed values and the expected failure events we notice that they are a little higher with each age group 0 68 0 93 1 43 Although no statistically significant differences among them are seen it is possible that
294. oftware All rights reserved 201 15 AGREEMENT ANALYSIS Hypotheses Ho a lack of concordance between 9 judges assessments in the population represented by the sample H the 9 judges assessments in the population represented by the sample are concordant Analysis time Analysed variables Dance couple A Dance couple B Dance couple C Dance couple DI Significance level 0 05 Degrees of freedom 5 Kendall s coefficient of concordance 0 83351 Mean Spearman correlation coefficient 0 812698 Chiz statistic adjusted for ties 37 507937 p value lt 0 000001 Comparing the p lt 0 000001 with the significance level a 0 05 we have stated that the judges assessments are statistically concordant The concordance strength is high W 0 83351 similarly the average Spearman s rank order correlation coefficient rs 0 81270 This result can be presented in the graph where the X axis represents the successive judges Then the more intersection of the lines we can see the lines should be parallel to the X axis if the concordance is perfect the less there is the concordance of judges evaluations ranks judges 15 2 2 The Cohen s Kappa coefficient and the test of its significance The Cohen s Kappa coefficient Cohen J 1960 22 defines the agreement level of two times mea surements of the same variable in different conditions Measurement of the same variable can be performed by 2 different observers reproducibility or by a
295. om others in terms of age To gain such knowledge it must be used one of the POST HOC tests for example the Tukey test To do this you should resume the analysis by clicking amp and then in the options window for the test you should select Tukey HSD and Add graph Copyright 2010 2014 PQStat Software All rights reserved 149 STAT 12 COMPARISON MORE THAN 2 GROUPS transport transport transport i transport companyl transport company2 transport companys transport companyl transport company2 transport companys transport companyl transport company2 transport companys transport companyl transport companys 0 06965 transport companys 0 00403 0 49229 Mean Me 95 Stand dev qe mo 32 24 transport companyl transport company transport companys The critical difference CD calculated for each pair of comparisons is the same because the groups sizes are equal and counts to 2 730855 The comparison of the CD value with the value of the mean difference indicates that there are significant differences only between the mean age of the workers from the first and the third transport company only if these 2 groups are compared the CD value is less than the difference of the means The same conclusion you draw if you compare the p value of POST HOC test with the significance level a 0 05 The workers of the first transport company are about 3 years younger on average than the workers of the third tra
296. omenon does not occur The Z statistic asymptotically for large sample sizes has the normal distribution The p value designated on the basis of the test statistic is compared with the significance level a fp lt a gt reject Ho and accept H ifp gt a gt there is no reason to reject Ho Copyright 2010 2014 PQStat Software All rights reserved 212 16 DIAGNOSTIC TESTS 16 2 1 Selection of optimum cut off The point which is looked for is a certain value of the diagnostic variable which provides the optimum separation of the studied population into to groups in which the given phenomenon occurs and in which the given phenomenon does not occur The selection of the optimum cut off is not easy because it requires specialist knowledge about the topic of the study For example different cut offs will be required in on the one hand a test used for screening of a large group of people e g fora mam mography study and on the other hand in invasive studies conducted for the purpose of confirming an earlier suspicion e g in histopathology With the help of an advanced mathematical apparatus we can find a cut off which will be the most useful from the perspective of mathematics PQStat Program enables the selection of an optimum cut off by means of an analysis of the graph of the intersection of sensitivity and specificity Besides the optimum cut off can be computed on the basis of data about the costs of wrong dec
297. on the basis of the test statistic is compared with the significance level Q Copyright 2010 2014 PQStat Software All rights reserved 132 11 COMPARISON 2 GROUPS SAI fp lt a gt reject Ho and accept H fp gt a gt there is no reason to reject Ho Note In the interpretation of the relative risk significance we usually use the designated confidence interval Then we check if the interval contains the value of 1 The relative risk altogether with the asymptotic confidence intervals and the relative risk significance test are calculated by e Chi square test OR RR 2x2 window e Mantel Heanszel OR RR window for each table designated by the strata 11 2 8 The Z test for 2 independent proportions The Z test for 2 independent proportions is used in the similar situations as the chi test 2 x 2 It means when there are 2 independent samples with the total size of n and n with the 2 possible results to gain one of the results is distinguished with the size of m in the first sample and mz in the second one For these samples it is also possible to calculate the distinguished proportions p a and po a This test is used to verify the hypothesis informing us that the distinguished proportions P and P gt in populations from which the samples were drawn are equal Basic assumptions measurement on a nominal scale alternatively an ordinal or an interval an independent model lar
298. onditions degrees of freedom df BC Between subjects degrees of freedom df BS Residual degrees of freedom df RES Total degrees of freedom df T Mean square between conditions MS BC Mean square between subjects MS BS Mean square residual MS RES Intraclass correlation coefficient r ICC Average measure r ICC F statistic p value Copyright 2010 2014 PQStat Software All rights reserved sound level meter I sound level meter II sound lewe 138 222222 130 222222 6 444ida 2 11 22 35 D 777778 11 8383584 0 292929 0 92029 0 971939 40 413793 lt 0 000001 15 AGREEMENT ANALYSIS 197 ii values So oo ee Soe eee Sen ene eee 2 oe ee eo Soe eee See eee eee sound level meter I sound level meter I sound level meter M judges Comparing the p lt 0 000001 with the significance level a 0 05 we have stated that the sound intensity levels measured by three different meters are absolutely concordant in the analysed popu lation The strength of absolute concordance is high rroo 0 92029 Concordance of the results we also see in the Bland Altman plots 3 10 where almost all of the values fall into the specified range Bland Altman Plot sound lavel mater sound level meter Il a0 81 82 83 B4 85 86 af Mean sound level meter sound level meter II Copyright 2010 2014 PQStat Software All rights reserved 198 15 AGREEMENT ANALYSIS SA Bland Altman Flot
299. ont sex exam pqs file Copyright 2010 2014 PQStat Software All rights reserved 130 11 COMPARISON 2 GROUPS oO The two sided p value of the contingency table from the 11 7 example is p 0 000054 So on the signif icance level a 0 05 similarly to the Fisher exact test the x test and x test with the Yate s correction you accept the alternative hypothesis veryfing that there is a dependence between sex and exam pass ing in the analysed population Significantly the exam was passed more often by women 2 50 06 out of all the women in the sample who passed the exam than by men 25 00 out of all the men in the sample who passed the exam 11 2 7 Relative Risk and Odds Ratio The risk and odds designation of occurence an analysed phenomenon on the basis of exposure to the factor that can cause it is estimated according to data collected in the contingency table 2 x 2 Table 11 4 The contingency table of 2 x 2 observed frequencies Observed frequencies Analysed phenomenon illness Oi not occurs control exposed Ou One On HOw Risk factor unexposed On Oe Ot FOnn Total Ou Om Orwt Or m O11 Ore Oni O22 If a study is a case control study the odds ratio of occurence the phenomenon is calculated for the table Usually they are retrospective studies the researcher decides on his own about the sample size with the phenomenon and about the control sample without the phenomenon
300. ore variables allowing us to predict the value of the gross profit They will be production costs variable no 3 advertising costs variable no 4 and author s popularity variable no 7 The results now seem to be less distant from the real values However there is no result for position no 35 because there was no information about the production costs of that book that is the factor on which we wanted to base our prediction 3 1 12 NORMALIZATION STANDARDIZATION The normalization standardization window is accessed via Data Normalization Standardization Ke Normalization Standardization Nomalization Standardization 2total cholesterol r Output variable Data reduced by the selected area Type The normalization of data is scaling them to a range e g to a range of 1 1 or 0 1 Min max normalization The min max normalization with the use of a linear function scales data to a NneWmin NEWmax range indicated by the user For that purpose we should know the range which the data can reach If we do not know the range we can avail ourselves of the greatest and the smallest values in the analyzed set in such a case we select the calculate from sample option in the Normaliza tion Standardization window j x min T neWmax NEWmin NEWmin 1 max min Logarithmic normalization Normalization with the use of the logarithmic function S shaped reduces the data to the range of 0 1 e
301. ories k gt 2 ought to be divided into dummy variables with two categories and coded k 2 Ifa variable is dichotomous it is the decision of the researcher how the data representing the variable will be entered so any numerical codes can be entered e g O and 1 In the program one can change one s coding into effect coding by selecting that option in the window of the selected multidimensional analysis Such coding causes a replacement of the smaller value with value 1 and of the greater value with value 1 k gt 2 Ifa variable has many categories then in the window of the selected multidimensional analysis we select the button Dummy variables and set the reference base category for those variables which we want to break into dummy variables The variables will be dummy coded unless the effect coding option will be selected in the window of the analysis in such a case they will be coded as 1 0 and 1 Dummy coding is employed in order to answer with the use of multidimensional models the question How do the Y results in any analyzed category differ from the results of the reference category The coding consists in ascribing value O or 1 to each category of the given variable The category coded as O is then the reference category k 2 Ifthe coded variable is dichotomous then by placing it in a regression model we will obtain the coefficient calculated for it b The coefficient is the reference of the value of the de pe
302. orrectly defined so they have obtained the negative result of mammography e 4 out of 100 examined women suffer from breast cancer e A woman who have obtained a positive mammography result can be 47 37 sure that she suffers from breast cancer e A women who have obtained a negative test result can be 99 57 sure that she does not suffer from breast cancer e The probability that the positive mammography result will be obtained by a woman genuinely suffering from cancer is 21 60 times greater than the probability that the positive mammography result will be obtained by a healthy woman not suffering from breast cancer e The probability that the negative mammography result will be obtained by a woman genuinely suffering from breast cancer is 10 43 of the probability that the negative mammography result will be obtained by a healthy woman not suffering from breast cancer e A woman undergoing mammography regardless of age can be 96 50 sure of the definitive diagnosis 16 2 ROC CURVE The diagnostic test is used for differentiating objects with a given feature marked as e g ill peo ple from objects without the feature marked as e g healthy people For the diagnostic test to be considered valuable it should yield a relatively small number of wrong classifications If the test is based on a dichotomous variable then the proper tool for the evaluation of the quality of the test is the analysis of a 2 x 2 contingency t
303. orrelation coefficient in a population rs the Spearman s rank correlation coefficient in a sample The value of rs E lt 1 1 gt and it should be interpreted the following way e r x 1 means a strong positive monotonic correlation increasing when the independent variable increases the dependent variable increases too e r amp l1 means a strong negative monotonic correlation decreasing when the independent variable increases the dependent variable decreases e if the Spearman s correlation coefficient is of the value equal or very close to zero there is no monotonic dependence between the analysed features but there might exist another relation a non monotonic one for example a sinusoidal relation The Kendall s 7 correlation coefficient Kendall 1938 42 is used to decribe the strength of monotonic relations between features It may be calculated on an ordinal scale or interval one The value of the Kendall s 7 correlation coefficient should be calculated using the following formula 2 ng n T a ae o N n n 1 Tx n n 1 Ty Copyright 2010 2014 PQStat Software All rights reserved 183 14 CORRELATION where nc number of pairs of observations for which the values of the ranks for the X feature as well as Y feature are changed in the same direction the number of agreed pairs np number of pairs of observations for which the values of the ranks for the X fe
304. ost part random noise 18 1 4 Defining principal components When we have decided how many principal components we need we can start generating them In the case of principal components created on the basis of a correlation matrix they are computed as a linear combination of standardized original values If however principal components have been created on the basis of a covariance matrix they are computed as a linear combination of eigenvalues which have been centralized with respect to the mean of the original values Copyright 2010 2014 PQStat Software All rights reserved 268 18 DIMENSION REDUCTION AND GROUPING The obtained principal components constitute new variables with certain advantages First of all the variables are not collinear Usually there are fewer of them than original variables sometimes much fewer and they carry the same or a slightly smaller amount of information than the original values Thus the variables can easily be used in most multidimensional analyses 18 1 5 The advisability of using the Principal component analysis If the variables are not correlated the Pearson s correlation coefficient is near 0 then there is no use to conduct a principal component analysis as in such a situation every variable is already a separate component Bartlett s test The test is used to verify the hypothesis that the correlation coefficients between variables are zero i e the correlation matrix is an identity matri
305. ot negation operator used in a conditional sentences if 3 1 10 HOW TO GENERATE DATA There are 2 methods of data generation 1 The first method uses a pull technique All the data are pulled from the selected cells into the neighbouring ones using a mouse arrow This method enables you to generate exactly the same values number or text ones in the neighbouring columns or rows To start data generation select a cell with the proper content then click on the right down corner using a mouse arrow illustrative sign and not letting it go just pull through all the cells you want to fill Pulling one cell can be done in any direction up down right left It is also possible to pull various values which are put in a one column left or right or in a one row up or down 2 The other method enables you to generate numerical data in columns as a data sequence ran dom values or random values of the proper data distribution To generate numerical data you should select a cell where you want to start filling the datasheet and open data generation window in Data menu Generate Data generator We indicate a variable in which the generated data will be placed In the middle part of the window depending on the way of data generation settings chosen above set e To generate data series Start value the first value which needs to be generated Copyright 2010 2014 PQStat Software All rights reserved 20
306. other One can study the distance of the objects with respect to many features e g when the compared objects are cities we can define their similarity on the basis of among other things the length of the road which joins them population density GDP pollution emissions average property prices etc With so many characteristics at the researcher s disposal he or she must select such a measure of distance as will best represent the real similarity of objects The window with the settings for the similarity matrix option is accessed from the menu Dane Similarity matrix E Similarity matrix Options Select the source Weights matrix Map Sheet According to distance According to contiguity Metric Distance Euclidean S Mahalanobisa Options for governments S Minkowski l all elements CityBlock O Chebyshev E CoSine Neighborhood 0 1 Row standardization Bray Curtis 6 Jaccard Tanimoto Modification for Euclidean only from selected rows no selection Copyright 2010 2014 PQStat Software All rights reserved 25 3 WORKING WITH DOCUMENTS The differences similarities of the objects are expressed with the use of distance usually in the form of a metric However not every measure of distance is a metric For a distance to be called a metric it has to fulfill 4 conditions 1 the distance between the objects cannot be a negative number d x 1 x2 gt
307. ow ing variables i e Floor on which the flat is located Age of the building Distance of the district center and Proximity of a bus or tram stop Copyright 2010 2014 PQStat Software All rights reserved 31 ee Normalization Standardization Input variable Output variable 11 Nonm No of the rooms kal Data reduced by the selection Type The normalized data are presented in the table below Wanted Flat 10 Flat 12 Flat 17 Flat 35 Flat 88 Flat 101 Flat 105 Flat 122 Flat 130 Flat 132 Flat 135 Norm Number of rooms 0 666666667 0 333333333 0 0 666666667 0 333333333 0 666666667 1 0 333333333 0 0 333333333 0 666666667 0 666666667 Norm Floor on which the flat is located Norm Age of the building 0 222222222 0 0 0 666666667 0 555555556 0 555555556 1 0 555555556 0 555555556 1 0 555555556 0 555555556 Norm Distance of the district center 0 666666667 0 166666667 0 166666667 0 0 0 166666667 0 1 0 166666667 of a bus or tram stop 0 142857143 0 285714286 0 387755102 1 0 183673469 0 387755102 0 0 081632653 0 183673469 0 020408163 0 795918367 0 183673469 Norm Proximity On the basis of the normalized data we will select the flats which are the most suited to the client s inquiry We will use the Euclidean distance metric to calculate the similarity The smaller the obtained value the more similar the properties The analysis can be
308. p of Liliefors test is p 0 016415 normality of residuals d statistic 0 155181 degrees of freedom 40 p value 0 016415 When we take a closer look of the outlier position 16 in the data for the task we see that the book is the only one for which the costs are higher than gross profit gross profit 4 thousand dollars the sum of costs 8 6 0 33 1 6 15 93 thousand dollars The obtained model can be corrected by removing the outlier For that purpose another analysis has to be conducted with a filter switched on which will exclude the outlier Data Filter variable condition value ne o he H 1P As a result we receive a model which is very similar to the previous one but is encumbered with a smaller error and is more adequate Copyright 2010 2014 PQStat Software All rights reserved 237 Analysis time Analysed variables Significance level Data Filter Number of estimated parameters R Ra Adjusted R2 Standard error of estimation Residual sum of squares Total sum of squares Explained sum of squares F p value 95 CI 1 010044 2 064531 1 641042 3 986247 0 393651 3 902274 95 CI 12 775213 3 293335 2 020761 7 026675 3 044503 10 862973 b error 2 691394 0 301989 0 216204 2 903129 0 844957 1 710653 6 892626 2 676933 2 060912 1 920214 1 525426 7 302624 intercept prod_c advert_c prom_c rebates popular_author t st
309. plete cases in a given range Copyright 2010 2014 PQStat Software All rights reserved 277 STM 19 SURVIVAL ANALYSIS e cumulative survival proportion survival function the probability of surviving over a given period of time Because to survive another period of time one must have survived all the previous ones the probability is calculated as the product of all the previous proportions of the survival cases standard error of the survival function e probability density the calculated probability of experiencing the event death in a given range calculated in a period of time standard error of the probability density e hazard rate probability calculated per a unit of time that a patient who has survived until the beginning of a given range will experience the event die in that range standard error of the hazard rate Note In the case of a lack of complete observations in any range of survival time range there is the possibility of using correction The zero number of complete cases is then replaced with value 0 5 Graphic interpretation We can illustrate the information obtained thanks to the life tables with the use of several charts e a survival function graph e a probability density graph e a hazard rate graph EXAMPLE 19 1 file transplant pqs Patients survival rate after the transplantation of a liver was studied 89 patients were observed over 21 years The age of a
310. qs file You want to analyse the compatibility of a diagnosis made by 2 doctors To do this you need to draw 110 patients children from a population The doctors treat patients in a neighbouring doctors offices Each patient is examined first by the doctor A and then by the doctor B Both diagnoses made by the doctors are shown in the table below pneumonia bronchitis others pneumonia 41 4 4 bronchitis g 39 5 others 5 T 3 Hypotheses Copyright 2010 2014 PQStat Software All rights reserved 204 15 AGREEMENT ANALYSIS Fig k Hyi m We could analyse the agreement of the diagnoses using just the percentage of the compatible values In this example the compatible diagnoses were made for 73 patients 31 39 3 73 which is 66 36 of the analysed group The kappa coefficient introduces the correction of a chance agreement it takes into account the agreement occurring by chance Analysis time Analysed variables Significance level bronchitis others pneumonii bronchitis others Size number of pairs Kappa coefficient 0 4458061 pneumonii Std err of Kappa 0 068029 95 CI for Kappa coefficient 0 312471 95 CI for Kappa coefficient 0 5791405 Std err of Kappa distribution 0 0723246 Z statistic 6 165942 p value asymptotic 0 0000001 bronchitis others pneumonii bronchitis 35 45 8 16 a he others 6 36 2 735 4 55 pneumonii 3 64 3 64 28 18 gt Observed frequencies E Expected frequencies
311. r From that we may infer that the first two principal components carry important information Together they explain a great part as much as 95 81 of the variance see the cumulative column The communalities for the first principal component are high for all original variables except the variable of the width of the sepal for which they equal 21 17 That means that if we only interpret the first principal component only a small part of the variable of the width of the sepal would be reflected Sepal Len 79 24004 92 25986 99 85857 Sepal Wic 21 17313 99 09193 99 9684 Petal Len 98 31816 98 37299 98 66944 Petal Wid 93 11843 93 52803 99 43209 For the first two principal components the communalities are at a similar very high level and they exceed 90 for each of the analyzed variables which means that with the use of those components the variance of each variability is represented in over 90 In the light of all that knowledge it has been decided to separate and interpret 2 components In order to take a closer look at the relationship of principal components and original variables that Copyright 2010 2014 PQStat Software All rights reserved 271 18 DIMENSION REDUCTION AND GROUPING is the length and the width of the petals and sepals we interpret eigenvectors factor loadings and contributions of original variables Sepal Len 0 52106 0 37741 0 719566 0 261286 Sepal Wicc 0 269347 0 9232 0 24438 0 12351
312. r R A 1934 Statistical methods for research workers 5th ed Edinburgh Oliver and Boyd 28 Fisher R A 1935 The logic of inductive inference Journal of the Royal Statistical Society Series A 98 39 54 29 Fisher R A 1936 The use of multiple measurements in taxonomic problems Annals of Eugenics 7 2 179 188 30 Fleiss J L 1981 Statistical methods for rates and proportions 2nd ed New York John Wiley 38 46 31 Freeman G H and Halton J H 1951 Note on an exact treatment of contingency goodness of fit and other problems of significance Biometrika 38 141 149 32 Freireich E O Gehan E Frei E Schroeder L R Wolman I J et al 1963 The effect of 6 mercaptopmine on the duration of steroid induced remission in acute leukemia Blood 21 699 716 33 Friedman M 1937 The use of ranks to avoid the assumption of normality implicit in the analysis of variance Journal of the American Statistical Association 32 675 701 34 Gehan E A 1965a A Generalized Wilcoxon Test for Comparing Arbitrarily Singly Censored Sam ples Biometrika 52 203 223 35 Gehan E A 1965b A Generalized Two Sample Wilcoxon Test for Doubly Censored Data Biometrika 52 650 653 36 Guttman L 1945 A basic for analyzing test retest reliabilit Psychometrika 10 255 282 37 Hanley J A 1987 Standard error of the Kappa statistic Psychological Bulletin Vol 102 No 2 315 321 38 Hanley J A i Ha
313. ra metric tests unordered categories Fisher RxC or in Wizard Fisher RxC Statistical analysis Fisher exact test RoC Report options E Add analysed data E Add graph C Add percentages Rows Info The process of calculation of p values for this test is based on the algorithm published by Mehta 1986 62 Note Note that comparisons relating to 2 chosen categories can be made using the tests for contingency tables 2 x 2 and the Bonferroni correction 1 Copyright 2010 2014 PQStat Software All rights reserved 124 fi 11 COMPARISON 2 GROUPS SMA 11 2 6 The Chi square test and the Fisher test for 2x2 tables with corrections These tests are based on the data gathered in the form of a contingency table of 2 features X Y each of them has 2 possible categories X1 X and Y4 Y look at the table 11 1 Basic assumptions measurement on a nominal scale dichotomous variables it means the variables of two cate gories an independent model The additional assumption for the x7 test large expected frequencies according to the Cochran interpretation 1952 20 none of these expected frequencies can be lt 1 and no more than 20 of the expected frequencies can be lt 5 e General hypotheses Ho Oj Ei for all categories H Oj Ei for at least one category where Oi observed frequencies in a contingency table Ej expected frequenci
314. ratio of two odds the odds that a person from the group of ill people will obtain a negative test result and the same effect will be observed among healthy people LR 1 sensitivity FN TP FN specificity TN FP TN Confidence interval for LR_ is built on the basis of the standard error sensitivity 1 specificity k FN TN e Accuracy Copyright 2010 2014 PQStat Software All rights reserved 207 16 DIAGNOSTIC TESTS Accuracy Acc the probability of a correct diagnose using a diagnostic test If the examined person obtains a positive or a negative test result the Acc informs how they can be sure about the definitive diagnosis TP TN n Acc Confidence interval is built on the basis of the Clopper Pearson method for a single proportion The settings window with the diagnostic tests can be opened in Stistics menu Diagnostic tests gt Diagnostic tests Diagnostic tests Statistical analysis Diagnostic tests sensitivity and specificity Report options Add analysed data Add graph Add percentages a i cael EXAMPLE 16 1 mammography paqs file Mammography is one of the most popular screening tests which enables the detection of breast cancer The following study has been carried out on the group of 250 people so called asymptomatic women at the age from 40 to 50 Mammography can detect an outbreak of cancer smaller than 5 mm and enables to note the change which is not a nodu
315. re than or less than dinners out of all the dinners served within a week in this canteen Copyright 2010 2014 PQStat Software All rights reserved 99 10 COMPARISON 1 GROUP stl Analysis time Analysed variables number of served dinners Significance level 0 05 Continuity correction No Data Filter day of the week Friday Group proportion Clopper Pearson Binomial Exact 95 CI for the proportion 0 083384 95 CI for the proportion 0 198387 zZ statistic 2 041241 p value asymptotic 0 041227 One sided p value exact Two sided p value exact 95x0 0 2 E Group proportion 0 18 0 16 0 14 Proportion 0 127 0 1 0 08 The proportion of the distinguished value in the sample is p 0 133 and 95 Clopper Pearson confidence interval for this fraction 0 083 0 198 does not include the hypothetical value of 0 2 Based on the Z test without the continuity correction p value 0 041227 and also on the basis of the exact value of the probability calculated from the binominal distribution p value 0 044711 you can assume on the significance level a 0 05 that on Friday there are statistically less than dinners served within a week However after using the continuity correction it is not possible to reject the null hypothesis p value 0 052479 Copyright 2010 2014 PQStat Software All rights reserved 100 11 COMPARISON 2 GROUPS Interval scale Ordinal scale Are the da
316. reserved 93 10 COMPARISON 1 GROUP oO Median Me oia L Min Max waiting days Comparing the p value 0 123212 of Wilcoxon test based on T statistic with the significance level 0 05 we draw the conclusion that there is no reason to reject the null hypothesis informing us that usually the number of awaiting days for the delivery which is supposed to be delivered by the analysed courier company is 3 Exactly the same decision you would make basing on the p value 0 111161 or p value 0 115817 of Wilcoxon test based upon Z statistic or Z with correction for continuity 10 2 3 The Chi square goodness of fit test The y test goodnes of fit is also called the one sample x7 test and is used to test the compatibility of values observed for r r gt 2 categories X1 X2 X of one feature X with hypothetical expected values for this feature The values of all 7 measurements should be gathered in a form of a table con sisted of r rows categories X1 X2 Xp For each category X there is written the frequency of its occurence O and its expected frequency F or the probability of its occurence p The expected frequency is designated as a product of np The built table can take one of the following forms Basic assumptions measurement on a nominal scale alternatively an ordinal scale or an interval scale large expected frequencies according to the Cochran interpretation 1952 20 none of the
317. rights reserved 266 18 DIMENSION REDUCTION AND GROUPING one As a result the closer a given original variable lies to the rim of the circle the better the representation of such a variable by the presented principal components The sign of the coordinates of the terminal point of the vector i e the sign of the loading factor points to the positive or negative correlation of an original variable and the principal components form ing the coordination system If we consider both axes 2 components together then original variables can fall into one of four categories depending on the combination of signs and their loading factors The angle between vectors indicates the correlation of original values 0 lt a lt 90 the smaller the angle between the vectors representing original variables the stronger the positive correlation among these variables a 90 the vectors are perpendicular which means that the original variables are not corre lated 90 lt a lt 180 the greater the angle between the vectors representing the original variables the stronger the negative correlation among these variables Biplot The graph presents 2 series of data placed in a coordinate system defined by 2 principal components The first series on the graph are data from the first graph i e the vectors of original variables and the second series are points presenting particular cases N J O O q factor 1 Po
318. rmination coefficient it falls within the range lt Q 1 gt and defines the relation of only the variance of the given independent variable X with the complete variance of the dependent variable Y The closer the value of those coefficients to 0 the more useless the information carried by the studied variable which means the variable is superfluous e R squared R lt 0 1 gt it represents the percentage of variance of the given independent variable X explained by the remaining independent variables The closer to value 1 the stronger the linear relation of the studied variable with the remaining independent variables which can mean that the variable is a superfluous one e Tolerance 1 R lt 0 1 gt it represents the percentage of variance of the given indepen dent variable X NOT explained by the remaining independent variables The closer the value of tolerance is to O the stronger the linear relation of the studied variable with the remaining independent variables which can mean that the variable is a superfluous one e A comparison of a full model with a model in which a given variable is removed The comparison of the two model is made with by means of F test in a situation in which one variable or more are removed from the model see the comparison of models t test when only one variable is removed from the model It is the same test that is used for studying the significance of particular
319. rs There is a Message bar at the top of each datasheet The message bar displays all current information for you The left side of the bar gives you all information about the dimension of the selected area like the number of rows columns the centre part of the bar displays the value occurred in the selected cell and the right side of the bar gives you information mainly about a statistical analysis which is in progress at that moment Copyright 2010 2014 PQStat Software All rights reserved 10 E POStat v 14 0 E Project pqs File Edit Data Statistics Spatial analysis Help JOiebv Has e sae a Project 0R x 1C a4 SHP Data 1 1 Nearest Neighbor 7 year ural urban Descriptive statisti 2005 rural 2005 rural 2005 rural 2005 rural 2005 rural 2005 rural 2005 rural 2005 rural 2005 urban 2005 rural t test for depender 2 3 4 5 6 Fi a be 4 oo D al el ll aal aall os SS ee 4 ek 3 1 4 CELLS FORMAT Each datasheet cell including the column heading can contain a maximum of 40 signs Also allowed are texts containing national characters The introduced values can be formatted as e default in the case of the default format the program automatically recognizes the content of a cell with regard to numerical and text data e text in the case of the text format the data are interpreted as text alignment to the left edge of the cell e data in the case of the da
320. s 1 0002782725 2 241246614 1 As expected statistically significant differences only concern the survival curves of the youngest and oldest groups Copyright 2010 2014 PQStat Software All rights reserved 291 STM 19 SURVIVAL ANALYSIS 19 4 PROPORTIONAL COX HAZARD REGRESSION The window with settings for Cox regression is accessed via the menu Statistics Survival analy sis PH Cox regression E Cox PH Regression Statistical analysis Cox Proportional Hazards Regression ka ival time Vani gt 4 Survival time Variable X1 X2 Tes ontions 1 urvival time weeks 1 survival time weeks ag 0 censored 1 relapse 2 status 0 censored 1 relapse tog WEC Hfects coding dog WEC Ld S Data Filter Set of the conditions that are applied to data to g produce a subset of your data All the rules are combined using the logical AND M E multiple ann AA Add analysed data V Add residuals V Add graph V Add resulting hazard Mean std dev Cox regression also known as the Cox proportional hazard model is the most popular regressive method for survival analysis It allows the study of the impact of many independent variables X1 X9 Xk on survival rates The approach is in a way non parametric and thus encumbered with few assump tions which is why it is so popular The nature or shape of the hazard function does not have to be known and the only condition is t
321. s frequency in particular measurements is always the same i The value of the critical difference is calculated by using the following formula CD VFadfncdres k D 3 M Sres where Fa dfgo dfres S the critical value statistic of the F Snedecor distribution for a given significance level a and dfgc and dfres degrees of freedom ii The test statistic is defined by Zi Ta i 2 j ara Shs F M3r The test statistic has the F Snedecor distribution with dfgc and dfref degrees of freedom The Tukey test For simple comparisons frequency in particular measurements is always the same i The value of the critical difference is calculated by using the following formula 2 V2 da dfwa k 2 M Ores ODs a t gt 5 p where qa dfres k S the critical value statistic of the studentized range distribution for a given significance level a and dfres and k degrees of freedom ii The test statistic is defined by k q v2 2 cjd 2 Cj 2 2 M Sres The test statistic has the studentized range distribution with dfres and k degrees of freedom Copyright 2010 2014 PQStat Software All rights reserved 154 12 COMPARISON MORE THAN 2 GROUPS Info The algorithm for calculating the p value and statistic of the studentized range distribution in PQStat is based on the Lund works 1983 54 Other applications or web pages may calculate a little bit dif ferent value
322. s 3 equivalent hypotheses Ho all 6 O H exists 6 0 Ho R 0 H R 0 Ho linearity of the relation H a lack ofa linear relation The test statistics has the form presented below _ ms Rus where EMs the mean square explained by the model Rss Rug We residual mean square R dfe k dfr n k 1 appropriate degrees of freedom That statistics is subject to F Snedecor distribution with dfg and dfr degrees of freedom The p value designated on the basis of the test statistic is compared with the significance level a fp lt a reject Ho and accept H ifp gt a gt there is no reason to reject Ho Copyright 2010 2014 PQStat Software All rights reserved 230 17 MULTIDIMENSIONAL MODELS 17 2 2 More information about the variables in the model e Standardized b b2 6 In contrast to raw parameters which are expressed in different units of measure depending on the described variable and are not directly comparable the standardized estimates of the parameters of the model allow the comparison of the contribution of particular variables to the explanation of the variance of the dependent variable Y e Correlation matrix contains information about the strength of the relation between particular variables that is the Pearson s correlation coefficient rp E lt 1 1 gt The coefficient is used for the study of the corrrelation of each pair of variables wit
323. s also called statistical distance It is weighted by the covariance ma trix which allows the comparison of objects described by mutually correlated features The use of the Mahalanobis distance has two basic advantages 1 The variables for which greater deviations or value range are observed do not have an increased influence on the result of the Mahalanobis distance because when we use a covariance matrix we standardize the variables with the use of the variance on the diagonal As a result before starting the analysis one does not have to standard ize normalize the variables 2 It takes into account the mutual correlation of the features describing the compared objects when we use a covariance matrix we use the information about the depen dency among the features which is placed beyond the diagonal of the matrix d z1 2 y Z Y SHT Y The measure calculated in that manner fulfills the requirements of being a metric Cosine The cosine distance ought to be calculated on positive data because it is not a metric id does not fulfill the first condition d x1 2 gt 0 If then there are characteristics which also have neg ative values we should transform them in advance with the use for example of normalization to a range of positive numbers The advantage of that distance is that for positive arguments it is limited to the range of 0 1 A similarity of two objects is represented by the angle between the two
324. s like this higher primaryt secondary female 20 59 11 76 11 76 male 17 65 23 53 14 71 We can distinguish 2 approaches for analysed contingency tables We can analyse the independence between both features or their homogeneities It means to check if there are any differences between distribution of the first feature variable and the second one However these approaches sound dif ferently as they both lead to the same calculations 11 2 4 The Chi square test for trend for Rx2 tables The x test for trend is used to detremine whether there is a trend in proportion for particular categories of an analysed variables features It is based on the data gathered in the contingency tables of 2 features The first feature has the possible r ordered categories X1 X9 X and the second one has 2 categories G1 G table 11 3 Table 11 3 The contingency table of r x 2 observed frequencies x On On Wi On On X On Om We On 0 E Observed frequencies Feature 1 feature X ECE On o _0 Basic assumptions measurement on an ordinal scale or on an interval scale an independent model the second feature 2 independent groups Hypotheses Ho Inthe analysed population the trend in a proportion of p1 p2 Pr does not exist H There is the trend in a proportion of pj po p in the analysed population where are the proportions p Q4 G21 On P15 P25 5 Pr P W
325. s reserved 148 12 COMPARISON MORE THAN 2 GROUPS Analysis time O 11sec Analysed variables age company Significance level 0 05 Grouping variable company transport comp Group name transport company1 Group size 50 Group mean 30 26 Group standard deviation 5 23259 Std err of the group mean 0 74 95 CI for the group mean 28 772914 95 CI for the group mean 31 747086 Group name transport company2 Group size 50 Group mean 32 68 Group standard deviation 6 358154 Std err of the group mean 0 899179 95 CI for the group mean 30 873033 95 CI for the group mean 34 486967 Group name transport companys Group size 50 Group mean 33 98 Group standard deviation 5 482775 Std err of the group mean 0 775381 95 CI for the group mean 32 421813 95 CI for the group mean 35 538187 Total sum of squares SS T 5151 893333 Between groups sum of squares SS BG 356 413333 Within groups sum of squares SS WG 4795 48 Mean square between groups MS BG 178 206667 Mean square within groups MS W6G 32 622313 Between groups degrees of freedom df BG 2 Within groups degrees of freedom df WG Total degrees of freedom df T F statistic p value Comparing the p value 0 005147 of the one way analysis of variance with the significance level a 0 05 you can draw the conclusion that the average ages of workers of these transport companies is not the same Based just on the ANOVA result you do not know precisely which groups differ fr
326. s than PQStat because they may be based on less precised or more restrictive algorithms Copenhaver and Holland 1988 Gleason 1999 The settings window with the Single factor repeated measures ANOVA can be opened in Statistics menu Parametric tests ANOVA for dependent groups or in Wizard ANOVA for dependent groups Statistical analysis Single factor repeated measures ANOVA Variable 1 Glucose test 243lucosetest Halucose test IIl 7 4 Glucose est IV POST HOC Fisher LSD m 4alucosetest 4alucosetest Halucosetest IIl Em Tr asa As Test options Data Filter set of the conditions that are applied to produce a subset of your data All the rules are combined using the logic 3 basic E multiple te Report options Add analysed data More results Z Add graph Copyright 2010 2014 PQStat Software All rights reserved 155 12 COMPARISON MORE THAN 2 GROUPS D 12 2 NONPARAMETRIC TESTS 12 2 1 The Kruskal Wallis ANOVA The Kruskal Wallis one way analysis of variance by ranks Kruskal 1952 46 Kruskal and Wallis 1952 47 is an extension of the U Mann Whitney test on more than two populations This test is used to verify the hypothesis determing insignificant differences between medians of the analysed variable in k gt 2 populations but you need to assume that the variable distributions are similar Basic assumptions measurement on an ordinal scale or on an interv
327. se expected frequencies can be lt 1 and no more than 20 of the expected frequencies can be lt 5 observed frequencies total should be exactly the same as an expected frequencies total and the total of all p probabilities should come to 1 Hypotheses Copyright 2010 2014 PQStat Software All rights reserved 94 10 COMPARISON 1 GROUP o Ho O E for all categories H O E for at least one category Test statistic is defined by r 2 2 5 0i Ei This statistic asymptotically for large expected frequencies has the x distribution with the number of degrees of freedom calculated using the formula df r 1 The p value designated on the basis of the test statistic is compared with the significance level a fp lt a reject Ho and accept H ifp gt a gt thereis no reason to reject Ho The settings window with the Chi square test goodness of fit can be opened in Statistics menu gt NonParametric tests unordered categories gt Chi square or in Wizard Chi square Statistical analysis Chi square test goodness of fit w Observed frequencies 1 day of the week number of served dinners S expected number of served dinners Efes cele IEE Data Filter set of the conditions that are applied to data to produce a subset of your data All the rules are using the logical AND basic mlipe W Report options E Add analysed data V Add graph Yor EXAMPLE 10
328. se the module Dependent ROC Curves a comparison described in Chapter e Automatic model comparison In the case of automatic model comparison we receive very similar results The best model is the one constructed on the basis of independent variables AGE EDUCATION TIME needed for the completion of the task DISTURBANCES On the basis of the analyses above from the statistical point of view the optimumm model is a model with the 4 most important independent variables AGE EDUCATION TIME needed for the completion of the task DISTURBANCES An exact analysis can be made in module Logistic Regression However the ultimate decision about which model to choose is up to the experiment maker Copyright 2010 2014 PQStat Software All rights reserved 263 18 DIMENSION REDUCTION AND GROUPING 18 DIMENSION REDUCTION AND GROUPING As the number of variables subjected to a statistical analysis grows their precision grows but so does the level of complexity and difficulty in interpreting the obtained results Too many variables increase the risk of their mutual correlation The information carried by some variables can then be redun dant i e a part of the variables may not bring in new information for analysis but repeat the informa tion already given by other variables The need for dimension reduction a reduction of the number of variables has inspired a whole group of analyses devoted to that issue such as factor analysis princi
329. sed variable are made several times k gt 2 each time in different conditions It is also used when we have rankings coming from different sources form different judges and concerning a few k gt 2 objects but we want to assess the grade of the rankings agreement 12 2 2 The Friedman ANOVA Basic assumptions measurement on an ordinal scale or on an interval scale a dependent model Hypotheses Ho 0i ba e Op H notall 6 are equal j 1 2 k where 01 02 0 medians for an analysed features in the following measurements from the ex amined population The test statistic is defined by si p a nk k 1 k n 2 gt rs 3n k 1 i 1 j where n sample size Ri ranks ascribed to the following measurements j 1 2 k separately for the analysed objects i 1 2 7 t t cay te n k k t number of cases included in a tie correction for ties The formula for the test statistic x2 includes the correction for ties C This correction is used when ties occur if there are no ties the correction is not calculated because of C 1 The x2 statistic asymptotically for large sample size has the y distribution with the number of degrees of freedom calculated using the formula df k 1 The p value designated on the basis of the test statistic is compared with the significance level a fp lt a reject Ho and accept H ifp
330. served 19 SURVIVAL ANALYSIS Schoenfeld Rx 0 2 Residuals o S S g Fo Pd We do not observe any outliers however the martingale and deviance residuals become lower the longer the time Shoenfeld residuals have a symmetrical distribution with respect to time In their case the analysis of the graph can be supported with various tests which can evaluate if the points of the residual graph are distributed in a certain pattern e g a linear dependency In order to make such an analysis we have to copy Shoenfeld residuals together with time into a datasheet and test the type of the dependence which we are looking for The result of such a test for each variable signifies if the assumption of hazard proportionality by a variable in the model has been fulfilled It has been fulfilled if the result is statistically insignificant and it has not been fulfilled if the result is statistically significant As a result the variable which does not fulfill the regression assumption of the Cox proportional hazard can be excluded from the model In the case of the Log WBC and Rx variables the symmetrical distribution of the residuals suggests the fulfillment of the assumption of hazard proportionality by those variables That can be confirmed by checking the correlation e g Pearson s linear or Spearman s monotonic for those residuals and time Later we can add the sex variable to the model However we have to act with caution because we
331. set the rule for the sex variable we select the button fal again and set the rule for the age variable Remember in order to do the exercise correctly all filter conditions should be connected with the conjunction we are informed about it by the sign 1 We set the selected option Deactivate and confirm these analysis conditions by clicking the button OK When narrowing down the workspace in the data sheet we should remember that the filter conditions can be connected with the conjunction or with the alternative The change of the alternative and the conjunction is made with the buttons anojan To activate all cases one should select the menu Edit Activate all Copyright 2010 2014 PQStat Software All rights reserved 43 4 HOW TO ORGANISE WORK WITH PQSTAT 2 You can select the coherent area This causes the analysis we choose is performed using only the selected rows and columns which include necessary data EXAMPLE 4 4 filter pqs file You want to calculate descriptive statistics for the height of each girl who is between 10 and 15 years old In order to calculate this you need to sort data according to sex and age columns then you need to select the coherent area of the column which includes 10 to 15 years old girls height and to select Descriptive statistics from Statistics menu PQStat v 14 0 C Program Files POStat Dane EN_filter pqs File Edit Data Statistics Spatial analysis Help goleni nisoe G EN filter
332. sis time Variable amount of used free minutes Frequency CUMulative Cumulative frequency percent 130 135 135 140 140 145 145 150 150 155 155 160 160 165 165 170 170 175 175 180 180 185 185 190 A graphical presentation of results included in a table is usually done using a histogram or a bar plot 18 16 Percent 130 135 135 140 140 145 145 150 150 155 155 160 160 165 165 170 170 175 175 180 180 185 185 190 amount of used free minutes Such graph can be created by selecting Add graph option in the Frequency tables window Theoretical data distribution which is also called a probability distribution is usually presented graphi cally by means of a line graph Such line is described by a function mathematical model and it is called Copyright 2010 2014 PQStat Software All rights reserved 73 8 PROBABILITY DISTRIBUTIONS sl a density function You can replace the empirical distribution with the adequate theoretical distribu tion Note To replace an empirical distribution with the adequate theoretical distribution it is not enough to draw conclusions upon similarity of their shapes intuitively To check it you should use specially created compatibility tests The kind of probability distribution which is used the most often is a normal distribution Gaussian dis tribution Such distribution with a mean of 161 15 and a standard devia
333. square OR RR 2x2 Fisher Mid P 2x2 Chi square RxC Fisher RxC Chi square multidimentional measures of correlation Q Yule Phi 2x2 C Pearson V Cramer RxC measures of agreement Kappa Cohen e Diagnostic tests Diagnostic tests ROC Curve Dependent ROC Curves comparison Independent ROC Curves comparison e Multivariate models Multiple regression Multiple regression Comparing models Logistic regression Logistic regression Comparing models Copyright 2010 2014 PQStat Software All rights reserved 39 3 WORKING WITH DOCUMENTS sl Principal Component Analysis Stratified analysis Mantel Haenszel OR RR e Survival analysis Life tables Kaplan Meier Analysis Comparison groups Cox PH regression Cox PH regression Comparing models Scale Reliability Wizard Menu Spatial Analysis description in User Guide PQStat for Spatial Analysis Map Manager Tools Geometry calculations Spatial weights amtrix Spatial descriptive statistics e Spatial Statistics Nearest Neighbour Analysis Global Moran s statistic Global Geary s C Local Moran s statistic Local Getis Ord Gi statistic Menu Graphs Histogram Box Whiskers plot Error plot Scatter plot Line plot Copyright 2010 2014 PQStat Software All rights reserved 40 4 HOW TO ORGANISE WORK WITH PQSTAT 4 HOW TO ORGANISE WORK WITH PQSTAT All statistic analysis procedures are available in Statistics menu 4
334. ssed via the menu Statis tics Diagnostic tests gt Independent ROC Curves comparison Independent ROC curves comparison Statistical analysis Independent ROC curves comparison Diagnostic variable State variable 1 WBC 2 PCT 4 se Test options Method Data Filter Set of the conditions that are applied to data to produce a subset of your data All the rules are combined using the logical AND basic multiple 4 Report options Add analysed data 9 Add graph level EJ close Copyright 2010 2014 PQStat Software All rights reserved 219 EXAMPLE 16 2 c d bacteriemia pqs file We will make 2 comparisons 16 DIAGNOSTIC TESTS 1 We will construct 2 ROC curves to compare the diagnostic value of parameters WBC and PCT 2 We will construct 2 ROC curves to compare the diagnostic value of PCT parameter for boys and girls ad1 Both parameters WBC and PCT are stimulants in bacteremia their values are high In the course of the comparison of the diagnostic value of those parameters we verify the following hypotheses Ho the area under ROC curve for WBC the area under the ROC curve for PCT H the area under ROC curve for WBC the area Analysis time Analysed variables Count of missing data Significance level Grouping variable Size Size STATE yes Size STATE no Variable WBC Direction of diagnostic variable AUC SECAUC 95 CI 95 CI Vari
335. ssion model then their b1 b2 b 1 coefficients will be calculated b is the reference of the Y results for codes 1 in X1 to the unweighted general mean corrected by the remaining variables in the model b is the reference of the Y results for codes 1 in X2 to the unweighted general mean corrected by the remaining variables in the model bg isthe reference of the Y results for codes 1 in_X _1 to the unweighted general mean corrected by the remaining variables in the model textbfExample With the use of effect coding we will code the sex variable with two categories the male category will be the base category and a variable informing about the region of residence in the analyzed country 5 regions were selected northern southern eastern western and central The central region will be the base one Regions Coded regions western eastern northern southern central 1 1 1 1 central central western western western western eastern eastern eastern eastern northern northern southern southern southern oOo o oooooorrF F F Ooo0oOorRrrFPrRRrFOOoO Oo O oOoorroo0o0o0o 0 OO RPrPFrPFoOoO od coO 0 00 0 0 O Building on the basis of dummy variables in a multiple regression model we might want to check what impact the variables have on a dependent variable e g Y the amount of earnings ex pressed in thousands of PLN As a result of such an analysis we will obtain sample coefficients for each dum
336. st statistics p value is estimated and then compared with a Copyright 2010 2014 PQStat Software All rights reserved 260 17 MULTIDIMENSIONAL MODELS oO fp lt a gt wereject Ho and accept H fp gt a gt there is no reason to reject Ho We make the decision about which model to choose on the basis of the size R5 seudo RNagelkerke Re ea and the result of the Likelihood Ratio test which compares the subsequently created neigh boring models If the compared models do not differ significantly we should select the one with a smaller number of variables This is because a lack of a difference means that the variables present in the full model but absent in the reduced model do not carry significant information However if the difference is statistically significant it means that one of them the one with the greater number of variables with a greater R is significantly better than the other one In the program PQStat the comparison of models can be done manually or automatically e Manual model comparison construction of 2 models a full model a model with a greater number of variables a reduced model a model with a smaller number of variables such a model is created from the full model by removing those variables which are superfluous from the perspective of studying a given phenomenon The choice of independent variables in the compared models and subsequently the choice of a better mo
337. stic for the Odds Ratio p value Relative Risk 95 CI for the Relative Risk 95 CI for the Relative Risk Statistic for the Relative Risk p value Chi square statistic Degrees of freedom p value Statistic with Yates correction Degrees of freedom p value with Yates correct 100 a07 607 40 20 Contingency table 3 955391 0 000076 2 222222 1 456985 3 389378 3 707423 0 000209 16 325397 1 0 000053 15 088259 1 0 000103 37 06 32 94 22 94 47 06 25 020 44 44 25 Joo f m The expected frequency table does not contain any values less than 5 The p value 0 000053 So on the significance level a 0 05 we can accept the alternative hypothesis informing us that there is a dependence between sex and exam passing in the analysed population Sig nificantly the exam is passed more often by women 55 56 out of all the women in the sample Copyright 2010 2014 PQStat Software All rights reserved 127 11 COMPARISON 2 GROUPS who passed the exam than by men 22 25 00 out of all the men in the sample who passed the 80 exam The Chi square test with the Yate s correction for continuity The y test with the Yate s correction Frank Yates 1934 87 is a more conservative test than the y7 test it rejects a null hypothesis more rarely than the y test The correction for continuity guarantees the possibility of taking in all the values of real numbers by a test statistic
338. stions of assumptions which should be checked before the choice of a particular statistic test The last step of the wizard is to select an appropriate statistic test and to open the window with the settings of the test options The Wizard may be launched by Statistics Wizard button on a toolbar A launched wizard window includes the possibility to choose the kind of an analysis that a user wants to carry out A user may choose Comparison 1 group to compare values of measurments coming from a 1 population with the specific value given by the user This population is represented by raw data gathered ina 1 column or cumulated to the form of a frequency table Comparison 2 groups to compare values of measurments coming from 2 populations These populations are represented by raw data gathered in 2 columns or cumulated to the form of a contingency table Comparison more than 2 groups to compare values of measurments coming from several populations The populations are represented by data collected in the form of raw data in several columns Correlation to check the occurrence of dependence between 2 parameters coming from a 1 population These features are represented by raw data gathered in 2 columns or cumulated to the form of a contingency table Agreement to check the concordance of obtained measurements These features are repre sented by raw data gathered in several columns or cumulated to the form of a co
339. surement on a nominal scale alternatively an ordinal scale or an interval scale Copyright 2010 2014 PQStat Software All rights reserved 161 STM 12 COMPARISON MORE THAN 2 GROUPS D large expected frequencies according to the Cochran interpretation 1952 20 none of these expected frequencies can be lt 1 and no more than 20 of the expected frequencies can be an independent model lt 5 Hypotheses Ho Oij Eij for all categories H Oj Eij for at least one category where O and E j observed frequencies in a contingency table and the corresponding ex pected frequencies The test statistic is defined by 2 i Oi Ey LIAA i 1 j 1 e This statistic asymptotically for large expected frequencies has the y distribution with a number of degrees of freedom calculated using the formula df r c 1 1 r l e 1 r 1 1 1 c 1 1 1 for 3 dimensional tables The p value designated on the basis of the test statistic is compared with the significance level a fp lt a gt reject Ho and accept H ifp gt a gt there is no reason to reject Ho The settings window with the Chi square multidimensional test can be opened in Statistics menu gt NonParametric tests unordered categories gt Chi square multidimensional or in Wizard Chi square multidimentional Statistical analysis Multidimensional Chi squ
340. ta Are dependent the data N normally distributed Wilcoxon a test for vo dependent Whitney groups test A Are the data y x test dependent for trend t test for dependent groups Kolmogorov Smirnov or Lilliefors test Are the variances equal t test with Cochran Cox adjustment Fisher Snedecor test t test for independent groups Copyright 2010 2014 PQStat Software All rights reserved 11 COMPARISON 2 GROUPS Nominal scale Are the data dependent Bowker McNemar Z test for 2 proportions Rx C test x 2 x 2 test Fisher R x C test Fisher test mid p 2 x 2 Z test for 2 proportions 101 11 COMPARISON 2 GROUPS SMA 11 1 PARAMETRIC TESTS 11 1 1 The Fisher Snedecor test The F Snedecor test is based on a variable F which was formulated by Fisher 1924 and its distribution was described by Snedecor This test is used to verify the hypothesis about equality of variances of an analysed variable for 2 populations Basic assumptions e measurement on an interval scale e normality of distribution of an analysed feature in both populations e an independent model Hypotheses 2 2 Ho O07 05 2 2 Hi of 05 where ey os variances of an analysed variable of the 1st and the 2nd population The test statistic is defined by sd 5 sd5 where sd sds variances of an analysed variable of
341. tatistics p value is estimated and then compared with the significance level a fp lt a gt wereject Ho and accept H1 fp gt a gt there is no reason to reject Ho Additionally for ROC curve the suggested value of the cut off point of the predicted proba bility is given together with the table of sensitivity and specificity for each possible cut off point Note More possibilities of calculating a cut off point are offered by module ROC curve The analy sis is made on the basis of observed values and predicted probability obtained in the analysis of Logistic Regression Copyright 2010 2014 PQStat Software All rights reserved 249 17 MULTIDIMENSIONAL MODELS kai e Classification On the basis of the selected cut off point of predicted probability we can change the classi fication quality By default the cut off point has the value of 0 5 The user can change the value into any value from the range of 0 1 e g the value suggested by the ROC curve As a result we shall obtain the classification table and the percentage of properly classified cases the percentage of properly classified 0 specificity and the percentage of properly classified 1 sensitivity Prediction on the basis of the model On the basis of a selected cut off point of predicted probability and of the given values of in dependent variables we can calculate the predicted value of the dependent value 0 or 1 By default the cut off point has t
342. te format the data are interpreted as subsequent values of a date thus value 1 means 1899 12 31 value 2 means 1900 01 01 and so on Depending on the selected date format one can also introduce text data in a selected format 2010 12 31 31 12 2010 12 31 2010 2010 12 31 31 12 2010 12 31 2010 2010 12 31 31 12 2010 12 31 2010 e time in the case of the time format the data are interpreted as subsequent values of time and the decimal part of a number means the number of milliseconds from midnight divided by the total number of milliseconds in a day 86 400 000 thus value 0 000694444 means 00 01 00 value 0 041666667 means 01 00 00 and value 0 999988426 means 23 59 59 Depending on the selected time format one can also enter text data in a selected format 18 31 58 18 31 12 31 2010 18 31 12 31 2010 18 31 58 Copyright 2010 2014 PQStat Software All rights reserved 11 3 WORKING WITH DOCUMENTS e numerical real numbers in this format are in the form of a decimal and the sign dividing the whole number from the decimal number is a comma or a dot depending on the settings selected in the window hyperlinksettingsSettings in the field Decimal separator it is possible to set the number of decimals and the thousands separator e scientific i e when M 10 is used where the basis is the M mantissa and the E index of the power is an integer as in the numerical format it is possible to set the number of decimals
343. ted frequencies does not contain any values which are less than 5 The p value 0 03174 So on the basis of the significance level a 0 05 we can draw the conclusion that there is a dependence between education and country of residence in the analysed population The Fisher test for R x C tables The Fisher test for r x c tables is also called the Fisher Freeman Halton test Freeman G H Halton J H 1951 31 This test is an extension on r x c tables of the Fisher s exact test It defines the exact probability of an occurrence specific distribution of numbers in the table when we know n and we set the marginal totals If you define marginal sums of each row as where Oi observed frequencies in a table Copyright 2010 2014 PQStat Software All rights reserved 123 11 COMPARISON 2 GROUPS and the marginal sums of each column as K 5 Ci t then having defined the marginal sums for the different distributions of the observed frequencies rep resented by U you can calculate the P probabilities D7 K eee gh 1 where Wi Wo Wr D Wi Wa Wp The exact significance level p is the sum of P probabilities calculated for new values U which are smaller or equal to P probability of the table with the initial numbers O The exact p value is compared with the significance level a The settings window with the Fisher exact test RxC can be opened in Statistics menu NonPa
344. tegories Fisher mid p 2x2 or in Wizard Copyright 2010 2014 PQStat Software All rights reserved 128 i 11 COMPARISON 2 GROUPS st Fisher Mid P 2x2 Statistical analysis Fisher exact test mid p 2x2 Variable 1 Varable 2 Data Filter set of the conditions that are applied to data to ca produce a subset of your data All the rules are combined using the logical AND basic multipe WW Report options Add analysed data Add graph Daae EXAMPLE 11 7 cont sex exam pqs file Hypotheses Ho there is no dependence between sex and exam passing in the analysed population H there is a dependence between sex and exam passing in the analysed population Analysed variables Significance level Size 170 Odds Ratio exact 0 268664 95 CI for the Odds Ratio exact 0 130742 95 CI for the Odds Ratio exact 0 537735 Odds Ratio mid p 0 269805 95 CI for the Odds Ratio mid p 0 137445 95 CI for the Odds Ratio mid p 0 514591 Fisher One sided p value Two sided p value Mid p 2 one sided p value Copyright 2010 2014 PQStat Software All rights reserved 129 11 COMPARISON 2 GROUPS D P no f m The two sided p value 0 000083 So using the Fisher exact test similarly to the x test and the y7 test with the Yate s correction on the significance level a 0 05 you accept the hypothesis informing us that there is a dependence between sex a
345. th the confidence interval median standard deviation altogether with the confidence interval and an information about the skewness and kurtosis of distribution altogether with errors At the top of the window you should see the following message To add a graph to the report we select Add graph option and chose the Box Whiskers plot type Confirm your choice by clicking OK and you get the result in a report Copyright 2010 2014 PQStat Software All rights reserved 71 7 DESCRIPTIVE STATISTICS SIl Analysis time Analysed variables Number of microorganisms significance level 0 05 Group size 54 Arithmetic mean 77 240741 Median 78 5 Standard deviation 24 425424 95 Confidence interval for the std dev 20 532603 30 153531 95 Confidence interval for the mean 70 573883 83 907598 Skewness 0 226875 Std err of the skewness 0 324556 Kurtosis 0 343163 Std err of the kurtosis 0 638893 80 70 Number of microorganisms 50 Copyright 2010 2014 PQStat Software All rights reserved 72 8 PROBABILITY DISTRIBUTIONS 8 PROBABILITY DISTRIBUTIONS A real data distribution from a sample empirical data distribution may be carried out in a mean of a frequency tables by selecting Statistic menu Frequency tables For example a distribution of the amount of used free minutes by subscribers of some mobile network operator example 6 1 distribu tion pqs file performs the following table Analy
346. th or without header before the conversion of the table into raw data Then in the conversion window the table will be places automatically It is also possible to use other labeled tables as a saved selection 3 1 9 FORMULAS Defining the formula is a way of calculating data so as to obtain new values for the variables Copyright 2010 2014 PQStat Software All rights reserved 16 3 WORKING WITH DOCUMENTS The window in which we define formulas is accessed by selecting Data Formulas i k Formulas only from selected rows natural logarithm logarithm of 10 square square root factorial converts from degrees to radians converts from radians to degrees aine cosine tangent cotangent arcsine arctangent Output variable Insert to existing fields Add new fields resu in Assign formula to output variable Select the proper columns and function Fj Run Formulas ascribed to a given variable of the datasheet as the format of that variable are remembered together with the datasheet Their result is automatically recalculated when any of the entry data are changed The formula can be ascribed in the Formulas window or by selecting Column format Ctrl F10 Building formulas We write formulas in the edition field e We enter the variables to which the formulas refer by giving their numbers e g vi v2 e Text values are entered with the use of an apostrophe e g
347. than the prevalence coefficient for the population Because both the positive and negative predictive value depend on the prevalence coefficient when the coefficient for the population is known a priori we can use it to compute for each cut off Zeat corrected predictive values according to Bayes s formulas Sensitivity Papriori PPa eee 2 o oa revised Sensitivity Papriori 1 Specificity 1 Papriori Specificity 1 Papriori NF V reoizd ____ Specificity 1 Fapriors Specificity 1 Papriori 1 Sensitivity Papriori where Papriori the prevalence coefficient put in by the user the so called pre test probability of disease Zeat sensitivity specificity PPV NPV LR IR Acc PPVrev NPVrev E PPV NFM bites LR ec PPVrev1_ NPVrev1 sensitivity2 specificitys ax sensitivity specificity PPV NPVs DRx ERa Aco PPVs NPVeevs The ROC curve is created on the basis of the calculated values of sensitivity and specificity On the abscissa axis the x 1 specificity is placed and on the ordinate axis y sensitivity The points obtained in that manner are linked The constructed curve especially the area under the curve presents the classification quality of the analyzed diagnostic variable When the ROC curve coincides with the diag onal y x then the decision made on the basis of the diagnostic variable is as good as the random distribution of stu
348. that a person who is looking for a flat comes to a real estate agent and defines the oblig atory and optional characteristics of the desired property The characteristics which the flat must have are e itis a retail property the subject of separate ownership e itis in district A e jt is located in a low block of flats a maximum of 5 floors e it is not renovated average standard or sub standard The data concerning those flats are gathered in a table where 1 means that the property fulfills the search conditions and O means that it does not fulfill them 0 2cm The flats which do not fulfill the search conditions will be excluded from the analysis by deactivating appropriate rows We deactivate the rows which do not fulfill any of the conditions in the menu Edition Activate Deactivate filter 4 In a low block of flats 5 Not renovated O Activate Deactivate Eli Copyright 2010 2014 PQStat Software All rights reserved 30 3 WORKING WITH DOCUMENTS The conditions of the deactivation should be connected with an alternative we change T o on 11 flats appropriate for the segment fulfilling all 4 conditions were found in the search numbers 10 12 17 35 88 101 105 122 130 132 and 135 Now we will take into account the features which have a great impact on the client s choice but are not decisive e The number of rooms 3 e The floor on which the flat is placed 0 e The age of the building
349. the formula d x1 2 Minkowski The Minkowski distance is defined for parameters p and r equal to each other It is then a metric Such a kind of a metric allows the control of the process of calculating the similarity by giving values p and r in the formula d x 2 n zik Lox k 1 When we increase the r parameter we increase the weight ascribed to the difference between the objects for every characteristic When we change the p parameter we increase decrease the weight ascribed to less more distant objects If r and p are equal to 2 the Minkowski distance comes down to the Euclidean distance If they are equal to 1 to the city block distance If the parameters tend to infinity to the Chebyshev metric city block also called the Manhattan or taxicab metric It is the distance which allows the movement only within two perpendicular directions That kind of distance reminds movement along perpendicular streets a square street network reminiscent Copyright 2010 2014 PQStat Software All rights reserved 26 3 WORKING WITH DOCUMENTS of the grid layout of most streets on the island of Manhattan The metric is calculated with the formula n d 1 2 X x1p T k k 1 Chebyshev The distance between the compared objects is the greatest of the obtained distances for the particular characteristics of those objects d 1 2 max ik Lox Mahalanobis The Mahalanobis distance i
350. the median is done with the use of the first datasheet called Insert the median In the Missing data window we set a variable filled in as the gross profit and in this way select the value of the median as a method of replacement Consequently the missing data will be replaced with the value USD 46 850 We suspect that the profits are greater when famous authors books coded as 1 are sold and smaller when they arise from the sale of less known authors books coded as 0 We will then calculate the median of the gross profit separately for the famous authors books and for the less known authors books The imputation is made on the datasheet called Insert two medians We set twice a filter for the variable defining the popularity of an author variable 7 giving it respectively values 1 and 0 The obtained median of the gross profit in the group of the popular authors books is about USD 51 000 and in the group of the less popular authors books it is about USD 34 000 The missing data can also be replaced with the use of the regression model We choose the data sheet Insert from regression and once more select in the Missing data window a variable concerning the gross profit as the variable which ought to be filled in and select the Values predicted from regression Copyright 2010 2014 PQStat Software All rights reserved 23 3 WORKING WITH DOCUMENTS as a replacement method This time there will be m
351. the transplantation is 61 4 5 0 08 0 02 is the death probability for each year from the 9 12 range The results will be presented on a few graphs Cumulative survival proportion Survival probabilit oF o Ta lt OL OL ki L a my Wm e G The probability of survival decreases with the time passed since the transplantation We do not how ever observe a sudden plunge of the survival function i e a period of time in which the probability of death would rise dramatically Copyright 2010 2014 PQStat Software All rights reserved 279 19 SURVIVAL ANALYSIS Probability density 5 mn babil 5 Fra 0 3 3 6 6 9 9 12 12 15 15 18 18 21 Hazard 19 2 KAPLAN MEIER CURVES Kaplan Meier curves allow the evaluation of the survival time without the need to arbitrarily group the observations like in the case of life tables The estimator was introduced by Kaplan and Meier 1958 41 The window with settings for Kaplan Meier curve is accessed via the menu Survival analysis Multi dimensional Models Kaplan Meier Analysis Copyright 2010 2014 PQStat Software All rights reserved 280 19 SURVIVAL ANALYSIS E Kaplan Meier Analysis Statistical analysis Kaplan Meier Analysis Data Filter Set of the conditions that are applied to data to pa produce a subset of your data All the rules are combined using the logical AND J basic O multiple a
352. the values of the numbers inform about the natural order of the groups The numbers in the analysis are treated as the c1 co Cg weights 19 3 3 Survival curves for the stratas Often when we want to compare the survival times of two or more groups we should remember about other factors which may have an impact on the result of the comparison An adjustment correction of the analysis by such factors can be useful For example when studying rest homes and comparing the length of the stay of people below and above 80 years of age there was a significant difference in the results We know however that sex has a strong influence on the length of stay and the age of the inhabitants of rest homes That is why when attempting to evaluate the impact of age it would be a good idea to stratify the analysis with respect to sex Hypotheses for the differences in survival curves Ho Si t S t S t forallt H notall S t are equal Hypotheses for the analysis of trends in survival curves Ho Inthe studied population there is no trend in the placement of the S 93 57 curves H Inthe studied population there is a trend in in the placement of the S 93 S curves where Sj t S3 t S t are the survival curves after the correction by the variable determining the strata The calculations for test statistics are based on formulas described for the tests not taking into account the strata with the
353. thesis informing that the trend in proportions of p1 p2 ps does exists As shown in the contingency table of percentages calculated from the sum of columns there is a decreasing trend the more interested in the character s life the group of viewers is the smaller part of the group of new viewers is 11 2 5 The Chi square test and Fisher test for RxC tables These tests are based on the data gathered in the form of a contingency table of 2 features X Y One of them has possible r categories X1 X2 Xp and the other one c categories Y1 Yo Ye look at Copyright 2010 2014 PQStat Software All rights reserved 120 11 COMPARISON 2 GROUPS SAI the table 11 1 Basic assumptions measurement on a nominal scale alternatively an ordinal or an interval an independent model The additional assumption for the 7 large expected frequencies according to Cochran interpretation 1952 20 none of these ex pected frequencies can be lt 1 and no more than 20 of expected frequencies can be lt 5 e General hypotheses Ho Oj Ei for all categories H Oj Ei for at least one category where O observed frequencies in a contingency table Fy expected frequencies in a contingency table e Hypotheses in the meaning of independence Ho there is no dependence between the analysed features of the population both classifications are statistically independent according to X and Y f
354. tingency coefficient adjusted for the table size Cqq ee Cmax The C contingency coefficient is considered as statistically significant if the p value calculated on the basis of the x test designated for this table is equal to or less than significance level a Cadi The settings window with the measures of correlation C Pearson V Cramer can be opened in Statistics menu NonParametric tests unordered categories C Pearsona V Cramera RxC or in Wizard C Pearson V Cramer Rx Statistical analysis Measures of the correlation C Pearson V Cramer Vanable 1 Varable 2 E Contingency table Raw data Data Filter set of the conditions that are applied to data to produce a subset of your data All the rules are combined using the logical AND basic multiple ano Report options E Add analysed data Add graph F Add percentages Rows EXAMPLE 14 2 sex exam pas file There is a sample of 170 persons n 170 who have 2 features analysed X sex Y passing the exam Each of these features occurs in 2 categories X f Xo9 m Y yes Yo no Basing on the sample we would like to get to know if there is any dependence between sex and passing the exam in an analysed population The data distribution is presented in a contingency table Copyright 2010 2014 PQStat Software All rights reserved 191 Observed frequencies passing the exam Analysis time Analysed v
355. tio significance we usually use the designated confidence interval Then we check if the interval contains the value of 1 The odds ratio altogether with asymptotic confidence intervals and the odds ratio significance test are calculated by e Chi square test OR RR 2x2 window e Mantel Heanszel OR RR window for each table designated by the strata Exact intervals and the mid p intervals for the odds ratio are calculated by e Fisher exact test mid p 2x2 window The relative risk 2 x 2 table In the cohort study we can designate the risk of occurence the analysed phenomenon because the structure of phenomenon in the sample should come closer to the population from which the sample was taken and calculate the relative risk RR The estimated risk of occurence the analysed phenomenon is designated by the following formula R Cuter However the relative risk is designated by O11 O11 O12 RR O21 O21 O22 The test of significance for the RR This test is used to the hypothesis verification about the risk of occurence the analysed occurrence is the same in the group of exposed and unexposed to the risk factor Hypotheses Ho RR 1 H RR 1 The test statistic is defined by In RR 7 SE where ey es a ee ee SE oz gies On tO Standard error of the In RR The test statistic asymptotically for large sample size has the normal distribution The p value designated
356. tion 13 03 is presented by the data relating to the amount of used free minutes example 6 1 distribution pqs file Copyright 2010 2014 PQStat Software All rights reserved 74 8 PROBABILITY DISTRIBUTIONS sl e Normal distribution which is also called the Gaussian distribution or a bell curve is one of the most important distribution in statistics It has very interesting mathematical features and occurs very often in nature It is usually designated with N u o 8 1 CONTINUOUS PROBABILITY DISTRIBUTIONS A density function is defined by f z 4 0 a oo a where 00 lt 2 60 u an expected value of population its measure is mean o standard deviation Normal distribution is a symmetrical distribution for a perpendicular line to axis of abscissae going through the points designating the mean mode and median Normal distribution with a mean of u O and 1 N 0 1 isso called a standardised normal distribution e t Student distribution the shape of t Student distribution is similar to standardised normal distribution but its tails are longer The higher the number of degrees of freedom df the more similar the shape of t Student distribution to normal distribution A density function is defined by where lt o lt O df degrees of freedom sample size is decreased by the number of limitations in given calculations I is a Gamma function Copyright 2010 201
357. tistic 706 959243 Degrees of freedom 6 p value lt 0 000001 Kaiser Mayer Olkin coefficient KMO 0 540077 The value p of Bartlett s statistics points to the truth of the hypothesis that there is a significant differ ence between the obtained correlation matrix and the identity matrix i e that the data are strongly correlated The obtained KMO coefficient is average and equals 0 54 We consider the indications for conducting a principal component analysis to be sufficient The first result of that analysis which merits our special attention are eigenvalues Copyright 2010 2014 PQStat Software All rights reserved 270 18 DIMENSION REDUCTION AND GROUPING 1 2 918498 72 96244 2 918498 72 96244 2 0 91403 22 85076 3 832528 95 81320 3 0 14675 3 5608922 3 979285 99 48212 4 0 020715 0 517871 The obtained eigenvalues show that one or even two principal components will describe our data well The eigenvalue of the first component is 2 92 and the percent of the explained variance is 72 96 The second component explains much less variance i e 22 85 and its eigenvalue is 9 91 According to Kaiser criterion one principal component is enough for an interpretation as only for the first principal component the eigenvalue is greater than 1 However looking at the graph of the scree we can conclude that the decreasing line changes into a horizontal one only at the third principal component Eigenvalue 1 2 3 4 Eigenvalue numbe
358. tomatic model comparison is done in several steps step 1 Constructing the model with the use of all variables step 2 Removing one variable from the model The removed variable is the one which from the statistical point of view contributes the least information to the current model step 3 A comparison of the full and the reduced model step 4 Removing another variable from the model The removed variable is the one which from the statistical point of view contributes the least information to the current model step 5 A comparison of the previous and the newly reduced model In that way numerous ever smaller models are created The last model only contains 1 indepen dent variable As aresult each model is described with the help of adequacy measures Rea S Ee and the sub sequent neighboring models are compared by means of the F test The model which is finally marked as statistically best is the model with the greatest Rig and the smallest S Ee However as none of the statistical methods cannot give a full answer to the question which of the models is the best it is the researcher who should choose the model on the basis of the results Copyright 2010 2014 PQStat Software All rights reserved 241 17 MULTIDIMENSIONAL MODELS EXAMPLE 17 1 c d publisher pqs file To predict the gross profit from book sales a publisher wants to consider such variables as production cost advertising costs direct promotion
359. tton As a rule select the variable sex At the end confirm all chosen options by clicking OK As a result you get 2 reports separately for boys and separately for girls Descriptive statistics multiple Statistical analysis Descriptive statistics Measures of central tendency M n Anthmetic mean Geometric mean Harmonic mean Median E Mode Distribution C Std er of the skewness E Kurtosis C Std er of the kurtosis Data Filter vanable preview val Measures of variability E Variance Standard deviation Ltt ee Coefficient of the variability Std er of the mean E Confidence interval for the mean C Range E Interquartile range Percentiles Minimum Maximum E Lower quartile Upper quartile Percentile 10 H and percentile 90 C Add analysed data 25x a EJ E Add graph Jo Co eee i Copyright 2010 2014 PQStat Software All rights reserved 46 STM 4 HOW TO ORGANISE WORK WITH PQSTAT D To improve the performance of repeated analyses you can 4 3 MULTIPLE REPEATED ANALYSIS 1 Use the option of saving current analysis PQStat program saves recently performed analysis and its settings To go back to this analysis quickly just click button on the toolbar 2 Inthe analysis window choose many variables so that the analysis will be carried out repeatedly Results of the analyses will be returned in the following reports 3 Use the multiple f
360. tudied phenomenon greater then values greater than or equal to the cut off x gt ecat are classified in group e destimulant the growth of its value makes the odds of occurrence of the studied phenomenon smaller then values smaller than or equal to the cut off x gt ecat are classified in group Copyright 2010 2014 PQStat Software All rights reserved 210 16 DIAGNOSTIC TESTS For each of the amp cut offs we define true positive TP true negative TN false positive FP and false negative FN values Reality diagnostic variable Ti gt Teat TP FP Ti lt Zeat destimulant Reality GE lt TP FP diagnostic variable Ti lt eat KARA Ti gt Teat On the basis of those values each cut off 7 4 can be further described by means of sensitivity and specificity positive predictive values PPV negative predictive values V PV positive result likeli hood ratio LR negative result likelihood ratio LR and accuracy Acc Note The PQStat program computes the prevalence coefficient on the basis of the sample The computed prevalence coefficient will reflect the occurrence of the studied phenomenon illness in the population in the case of screening of a large sample representing the population If only people with suspected illness are directed to medical examinations then the computed prevalence coefficient for them can be much higher
361. tures but there might exist another relation a not linear one Graph 14 1 Graphic interpretation of rp If one out of the 2 analysed features is constant it does not matter if the other feature is changed the features are not dependent from each other In that situation rp can not be calculated Note You are not allowed to calculate the correlation coefficient if there are outliers in a sample they may make that the value and the sign of the coefficient would be completly wrong if the sample is clearly heterogeneous or if the analysed relation takes obviously the other shape than linear The coefficient of determination 7 reflects the percentage of a dependent variable a variability which is explained by variability of an independent variable Copyright 2010 2014 PQStat Software All rights reserved 175 14 CORRELATION A created model shows a linear relationship y Prt a 6 and a coefficients of linear regression equation can be calculated using formulas CEU gy gp p 5 e y PT i 1 14 1 2 The test of significance for the Pearson product moment correlation coefficient The test of significance for Pearson product moment correlation coefficient is used to verify the hy pothesis determining the lack of linear correlation between an analysed features of a population and it is based on the Pearson s linear correlation coefficient calculated for the sample The closer to O the value of 7 is the w
362. uares method is defined by 2 2 Srv In RR 7 Nn RRm s where 1 ys OF O o of 0 Of of 08 This statistic asymptotically for large frequencies has the y distribution with the number of degrees of freedom calculated using the formula df w 1 The p value designated on the basis of the test statistic is compared with the significance level Q fp lt a gt reject Ho and accept H fp gt a gt there is no reason to reject Ho Copyright 2010 2014 PQStat Software All rights reserved 173 14 CORRELATION 14 CORRELATION Interval scale Ordinal scale Nominal scale tests for x test and dedicated to them monotonic C b V contingency coefficients correlation or test for Q contingency coefficient coefficients r OT Are the data normally distributed Kolmogorov Smirnov or Lilliefors test tests for linear correlation coefficient rp and linear regression coefficient 8 The Correlation coefficients are one of the measures of descriptive statistics which represent the level of correlation dependence between 2 or more features variables The choice of a particular coeffi cient depends mainly on the scale on which the measurements were done Calculation of coefficients is one of the first steps of the correlation analysis Then the statistic significance of the gained coeffi cients may be checked using adequate tests Note Note that the dependence
363. ues of the HR estimator but also at the 95 confidence interval for those estimators The range for Rx in model A is 8 06 10 09 minus 2 03 wide and is narrower in model B 6 74 8 34 minus 1 60 That is why model B gives a more precise HR estimation than model A In order to make a final decision about which model A or B will be better for the evaluation of the effect of treatment Rx we will once more perform a comparative analysis of the models in the Cox PH pregression comparing models module This time the likelihood ratio test yields a significant result p lt 0 0001 which is the final confirmation of the superiority of model B That model has the lowest value of information criteria AlIC 148 6 AlCc 149 BIC 151 4 and high values of goodness of fit Pseudo eee 0 2309 Fates 0 7662 RGor Snell 0 7647 Copyright 2010 2014 PQStat Software All rights reserved 300 Analysis time Analysed variables Significance level Grouping variable Number of variables in the model 1 Convergence criterion met 27 Log Likelihood AIC Akaike criterion AlCc corrected Akaike criterion BIC Bayesian criterion Pseudo R2 McFadden R2 Nagelkerke R2 Coxa Snella Number of variables in the model 2 Convergence criterion met 2 Log Likelihood AIC Akaike criterion AICc corrected Akaike criterion BIC Bayesian criterion Pseudo R2 McFadden R2 Nagelkerke R2 Coxa Snella Chi square models comp
364. v test correction when a mean value jz and standard deviation o of the population from which the sample is taken are not known Basic assumptions measurement on an interval scale Hypotheses Ho distribution of an analysed feature in the population is the normal distribution H distribution of an analysed feature in the population is different from the normal one Copyright 2010 2014 PQStat Software All rights reserved 88 K DA 10 COMPARISON 1 GROUP Based on the data from the sample gathered in a cumulated frequency distribution and the adequate values of the area under a theoretical curve of the normal distribution you can calculate a value of the test statistic D D sup Fn x F a I where F a empirical cumulative distribution function of the normal distribution calculated in particular points of distribution for sample of n elements F x theoretical cumulative distribution function of the normal distribution This statistic has the Kolmogorov Smirnov distribution if you know the arithmetic mean and the stan dard deviation of the population or the Lilliefors distribution when the arithmetic mean and the stan dard deviation you want to estimate from the sample The p value designated on the basis of the test statistic is compared with the significance level a ifp lt a gt gt reject Ho accept H ifp gt a gt there is no reason to reject Ho The settings window with the
365. variable with the grades calculated on the basis of this variable is called ranking All reoccurring values have its own ascribed rank which is an arithmetic mean calculated from the fol lowing natural numbers proposed to these values These kinds of ranks are called ties For example to the variable of the following values 8 6 5 3 8 6 7 1 9 3 7 2 7 3 7 4 7 3 5 2 7 9 9 8 6 5 7 the following ranks are ascribed sorted values of variable ranks But to the variable with the values of 7 3 is ascribed the tie calculated from the numbers 7 and 8 and to the variable with the values of 8 6 the tie is calculated from the numbers 10 11 12 10 2 1 The Kolmogorov Smirnov test and the Lilliefors test The Kolmogorov Smirnov goodness of fit test for a single sample Kolmogorov 1933 45 is used to verify the hypothesis about the insignificance difference of an analysed variable distribution empirical distribution from the normal distribution theoretical distribution We use it in the situation when a mean value u and standard deviation a of the population from which the sample is taken are known When these parameters of the population are not known but are estimated and based on the sample the Kolmogorov test becomes pretty conservative using this test it is much harder to reject null hypothesis In such situation you should use the Lilliefors test Lilliefors 1967 1969 1973 51 52 53 This is the Kolmogorov Smirno
366. vectors representing the characteristics of those objects d x1 2 LS K where K is the similarity coefficient the cosine of the angle between two normalized vectors a L1kL2k ae Tik De T The objects are similar if the vectors overlap In such a case the cosine of the angle similarity equals 1 and the distance difference equals 0 The objects are different if the vectors are per pendicular In such a case the cosine of the angle similarity equals 0 The distance difference equals 1 K Copyright 2010 2014 PQStat Software All rights reserved 27 3 WORKING WITH DOCUMENTS SAI Bray Curtis The Bray Curtis distance the measure of dissimilarity ought to be calculated on positive data as it is not a metric it does not fulfill the first condition d x 1 22 gt 0 If then there are characteristics which also have negative values we should transform them in advance with the use for example of normalization to a range of positive numbers The advantage of that distance is the fact that for positive arguments it is limited to the 0 1 range where 0 means that the compared objects are similar and 1 that they are dissimilar Xr Lik Lael d x1 J 5 71 22 gt Tik T k 3 Calculating the measure of similarity BC we subtract the Bray Curtis distance from value 1 BC 1 d x1 2 6 Jaccard The Jaccard distance measure of dissimilarity is calculated for binary variables Jaccard
367. where z are following values of variables and pz is an arithmetic mean of these values N population size Variance is always positive but it is not expressed in the same units as measuring results Standard deviation measures a degree of spread of the measurements around arithmetic mean sample standard deviation sd V sd2 population standard deviation o Vo The higher standard deviation or a variance value is the more diversed is the group in relation to an analysed feature Note The sample standard deviation is a kind of approximation estimator of the population standard devia tion The population standard deviation value is included in a range which contains the sample standard Copyright 2010 2014 PQStat Software All rights reserved 69 7 DESCRIPTIVE STATISTICS deviation This range is called a confidence interval for standard deviation Coefficient of variation Coefficient of variation just like standard deviation enables you to estimate the homogeneity level of an analysed data collection It is formulated as d V 100 X where sd means standard deviation z means arithmetic mean This is a unitless value It enables you to compare a diversity of several different datasets of a one feature And also you are able to compare a diversity of several features expressed in different units It is assumed if V coefficient does not exceed 10 features indicate a statistically insignif
368. x Hypotheses Hos Med H M l where M the variance matrix or covariance matrix of original variables X1 X2 Xp I the identity matrix 1 on the main axis O outside of it The test statistic has the form presented below where p the number of original variables n size the number of cases A ith eigenvalue That statistic has asymptotically for large expected frequencies the distribution y with p p 1 2 degrees of freedom On the basis of test statistics p value is estimated and then compared with the significance level Q fp lt a gt wereject Ho and accept H1 fp gt a gt there is no reason to reject Ho The Kaiser Meyer Olkin coefficient The coefficient is used to check the degree of correlation of original variables i e the strength of the evidence testifying to the relevance of conducting a principal component analysis 2 Diki isi Tij p TO eae l a 2 2 9 DIN O T rij the correlation coefficient between the ith and the jth variable fij the partial correlation coefficient between the ith and the jth variable Copyright 2010 2014 PQStat Software All rights reserved 269 18 DIMENSION REDUCTION AND GROUPING SA The value of the Kaiser coefficient belongs to the range lt 0 1 gt where low values testify to the lack of a need to conduct a principal component analysis and high values are a reason for conducting such an analysis EX
369. y reaches the minimum Zweig M H 1993 89 The optimum cut off point of the diagnostic variable selected as described above will finally be marked on the ROC curve e Costs graph presents the calculated values of an wrong diagnosis together with their costs The values are computed according to the formula cost costpp FP costpyn FN The point marked on the graph is the minimum of the function presented above e Sensitivity and specificity intersection graph allows the localization of the point in which the value of sensitivity and specificity is simultaneously the greatest The window with settings for ROC analysis is accessed via the menu Statistics Diagnostic tests ROC curve Copyright 2010 2014 PQStat Software All rights reserved 213 16 DIAGNOSTIC TESTS ROC cune ha Statistical analysis ROC Curve Analysis v Diagnostic vanable 7 State varable 1 WEC 2 PCT Method Eee i Cut off point ine Cost FN wrong diagnosis Cost FP wrong diagnosis a E Prevalence a priori 0 00000712 Data Filter Set of the conditions that are applied to data to ae produce a subset of your data All the rules are combined using the logical AND basic mutiple W Add graph os SS EXAMPLE 16 2 file bacteriemia pqs Persistent high fever in an infant or a small child without clearly diagnosed reasons is a premise
370. y are made a couple of times for the same objects When measurements of the given fea ture are performed on the objects which belong to different groups these groups are called independent unpaired measurements Some examples of researches in dependent groups Examining a body mass of patients before and after a slimming diet examining reac tion on the stimulus within the same group of objects but in two different conditions for example at night and during the day examining the compatibility of evaluating of credit capacity calculated by two different banks but for the same group of clients etc Some examples of researches in independent groups Examining a body mass in a group of healthy patients and ill ones testing effectiveness of fertilising several different kinds of fertilisers testing gross domestic product GDP sizes for the several countries etc Note 2 A graph which is included in the Wizard window makes the choice of an appropriate Statistical test easier Test statistic of the selected test calculated according to its formula is connected with the ade quate theoretical distribution l a a 2 a2 value of test statistics The application calculates a value of test statistics and also a p value for this statistics a part of the area under a curve which is adequate to the value of the test statistics The p value enables Copyright 2010 2014 PQStat Software All rights reserved 82 9 HYPOTH
371. y correlated with the survival time and weakly correlated with one another When comparing models with various numbers of independent variables we pay attention to informa tion criteria AIC AICc BIC and to goodness of fit of the model R5 seudo RNagelkerke R ea enen For each model we also calculate the maximum of likelihood function which we later compare with the use of the Likelihood Ratio test Hipotezy Ho Lrm LRM H Lrm LRM where Lrm Lrm the maximum of likelihood function in compared models full and reduced Copyright 2010 2014 PQStat Software All rights reserved 297 19 SURVIVAL ANALYSIS The test statistic has the form presented below x7 2 In Ley Lrmu 2In Lrm 2lIn LFm The statistic asymptotically for large sizes has the x distribution with df krm krm degrees of freedom where krm i krm is the number of estimated parameters in compared models On the basis of test statistics p value is estimated and then compared with a fp lt a wereject Ho and accept H1 fp gt a gt there is no reason to reject Ho We make the decision about which model to choose on the basis of the size AIC AICec BIC RNagelkerke 7 eT and the result of the Likelihood Ratio test which compares the sub sequently created neighboring models If the compared models do not differ significantly we should select the one with a smaller number of variables This is because a lack of
372. ymptotically for large sample sizes has the rozklad x distribution with the degrees of freedom calculated according to the following formula df k 1 The p value designated on the basis of the test statistic is compared with the significance level a ifp lt a gt reject Ho and accept H ifp gt a gt _ thereis no reason to reject Ho The settings window with the test of the Kendall s W significance can be opened in Statistics menu NonParametric tests ordered categories Kendall s W or in Wizard Kendall s W Statistical analysis Test of the Kendall s W significance Dance couple A 3 Dance couple B 4 Dance couple C 5 Dance couple D 6 Dance couple E Dance couple F Data Filter set of the conditions that are applied to data to produce a subset of your data All the rules are combined using the logical AND basic mutiple W Report options E Add analysed data C More results Add graph EXAMPLE 15 2 judges pas file In the 6 0 system dancing pairs grades are assessed by 9 judges The judges point for example an artistic expression They asses dancing pairs without comparing each of them and without placing them in the particular podium place they create a ranking Let s check if the judges assessments are concordant WwW oO N Ui SS 4 4 2 2 3 5 3 2 NDAHRUDAARAOAD WNrPRPPWNPB NU nau UuU WH Pw HP BP HBP WW W PRPNNWrRRPWN Copyright 2010 2014 PQStat S
373. ysed data Add graph EXAMPLE 11 3 pain pas file There was chosen a sample consisting of 22 patients suffering from a cancer They were examined to check the level of felt pain 1 10 scale where 1 means the lack of pain and 10 means unbearable pain This examination was repeated after a month of the treatment with a new medicine which was supposed to lower the level of felt pain There were obtained the following results pain before pain after 2 2 2 3 3 1 3 1 3 2 3 2 3 3 4 1 4 3 4 4 5 1 5 1 5 2 5 4 5 4 6 1 6 3 7 2 7 4 7 4 8 1 8 3 Now you want to check if this treatment has any influence on the level of felt pain in the population from which the sample was chosen Hypotheses Ho the median of the differences between the level of pain before and after a month of treatment in the analysed population comes to 0 H the median of the differences between the level of pain before and after a month of treatment in the analysed population is different from 0 Copyright 2010 2014 PQStat Software All rights reserved 113 11 COMPARISON 2 GROUPS Analysis time Analysed variables pain before pain after Significance level 0 05 Continuity correction Yes Size number of pairs Count of omitted pairs equal values Median of the difference Sum of negative ranks Sum of positive ranks t statistic p value exact zZ statistic adjusted for ties p va

PQStat User Guide

Contents

Download Pdf Manuals

Related Search

Related Contents