Home

A User Manual for SPSS Analysis

1. Aadne Aasland A User Manual for SPSS Analysis CNAS 2008 Survey Data 311 NIBR Norwegian Institute for Urban and Regional Research Aadne Aasland A User Manual for SPSS Analysis CNAS 2008 Survey Data Table of Contents lur T 2 1 Introduction to the CNAS 2008 survey Cabdu usi ted rcli ee repiehelespered Fen ius 3 2 Types of data analysis iios ee erat Or paid tud eat dra Hd cla 6 3 Preparing the data for analysis Exploratory analysis and data cleaning 7 3 1 Disttibutiortot thed itianineionee aiaa Aai 7 3 2 Cleanino SHO A d aput ge te a e rl pea aude rie d 8 Dy OS ICES cps asi cab acura tao enm onini eeu uU p me Ru 9 d NAIA CS attalusiSd oso Dueb teuer h siu Died dale rd Roo ded SERA Ves dose 13 4 1 3E BC DE DEBITIS ated Sd hrec De GUN UM OA D Qu Do SUR Dd dis 13 4 2 Central tendeney ass etant D hb b ER ERE LATER Dd abet ID RR Pod ianain 16 4 3 PDS DETSION nieto poene E ata ied piste gaan er Oa A RE PPM 17 5 Comparing oroups Divatiate analysis ce suceso niet bae at tut e re 19 54 Bivariate measures of association and significance tests oo 23 6 Creatine additive dre xe insna a tee voe metre etd entere rer res 29 7 Multivariate analysts 2o esci eniro paie canit de EDU E E EEIR OR HAE ERAT 35 7 1 Multiple lieat deotesstODL aci deni dci cea ce EAT Uds 35 Tse TOPi Stie TECLES SIO Ri o coca cad Ta ata AU ae CA PUT 39 8 Presenting your findings making tables and graphs sess 46 Prefac
2. Nori 8 Discrete missing values 0 No None 8 E I l Surve 1 Yadavs D None 8 EN on in T 1 Selected Et None 8 Range plus one optional discrete missing value of Ed 1 Illiterate None 8 up Pr 0 Age Not R None 8 E IM on in T 1 Selected Et None 10 ye Gro D Age Not Rep Miro 929 depage Numeric 8 D Dependency b D Age Not Re None 930 c20ai Numeric 8 D Bicycle D No None ii 931 c20bi Numeric 8 D Motercycle 0 No None 10 932 c20ci Numeric 8 D Car Geep 0 No None 10 933 c20di Numeric 8 0 TractorBusTru D No None 10 8 0 f 9341 c20ei Numeric Electricitv D Noh None 10 15 The same frequency distribution can be illustrated in a graph as shown below This type of graph is often referred to as a histogram or bar chart Broad Age Group 00 to 14 15to 24 25 to 39 40 to 59 60 and Over SPSS allows for a variety of different types of graphs to present our data For these simple histograms you simply click on Charts under the Frequency command and click for Bar Charts IAB eo tb A EE BGESe9 res me vg Align Scale Scale l t Scale Variable s en Scale m E broadage z x Scale Scale E Bar charts Scale d n2 O Pie charts Scale L enu b O Histograms Scale 4 en P With normal curve Scale iod a Scale Display frequency tables CDM vaker facae z O Frequencies Percentages Scale ma EE 3 c20di Numeric 8
3. Removal of invalid impossible or extreme values Such data may be removed from the dataset and recoded as missing values Unusual values may be out of range physically impossible a person of 149 years unrealistic an income of 10000000000 Nepali rupies per month etc Outliers might also be marked for exclusion for the purpose of certain analyses Labeling missing values It may be necessary to label each missing value with the reason it is considered missing in order to guarantee accurate bases for analysis The data that you have received should be cleaned but sometimes we discover certain inconsistencies during data analysis One should then perform the appropriate cleaning Serious inconsistencies that are found should be reported to CNAS In a survey missing values correspond to skipped questions or impossible options A discussion in the research team should take place in determining how missing values should be handled In some cases missing values might be perfectly normal e g the variable How many lifestock are there with your family with different category C12a to C120 should only be answered by those households who in C11 said that their families keep livestock However in some cases missing values for important variables might exclude a record from certain analyses Sometimes it is appropriate to place normalized values in place of missing values We will come back to this when we go through how to compute additiv
4. 2x approximately 95 of the scores in the sample fall within two standard deviations of the mean approximately 99 of the scores in the sample fall within three standard deviations of the mean This information enables us to compare the performance of an individual on one vatiable with their performance on another even when the variables are measured on entirely different scales We can find the standard deviation using the frequency command 18 Wi Frequencies Variable s d 22a E a3 rd b4 E ad dis cl code b20 4 b5 4 b3 a 55 Percentile Values C Display frequency tables none None None z o 2 oO Dc oo Oo oo Central Tendency Ceon i C Cut points for 10 equal groups C Median T C Bercentile s C Mode Odi Numeric 8 Add F15un Dei Numeric 8 Chere Dfi Numeric 8 n Ogi Numeric 8 C Values are group midpoints num s Dispersion Ex Distribution T N S 5 Std deviatiog C Minimum C Skewness J TUE Cl Variance C Maximum Oki Numeric 8 C Kurtosis o l CI S E mean imm ind Numeric 8 wat Numeric 8 0 Category of Dri 1 Premitive None 10 tminc Numeric 8 0 Total Monthly None None 10 The table below shows the mean median mode minimum maximum and standard deviation for the age variable Statisti
5. g1a2 1 g1a3 1 g1a4_1 g1a5_1 g1a6_1 g1a7_1 g1a8_1 g1a9_1 g1a10_1 However testing the new scale in a reliability analysis gives a Chronbach s Alpha of 0 796 and shows that the new index would be improved by removing primary school as well Item Total Statistics Scale Corrected Squared Cronbach s Scale Mean if Variance if Item Total Multiple Alpha if Item Item Deleted ltem Deleted Correlation Correlation Deleted i 544 354 miis gla2 1 SMEAN gla gla3 1 SMEAN g1a3 glad 1 SMEAN g1a4 glab 1 SMEAN g1a6 gla 1 SMEAN g1a7 gla8 1 SMEAN g1a8 gla8 1 SMEAN g1a9 glal 1 SMEAN g1a10 gla5 1 SMEAN g1a5 One should do this exercise until one reaches the best possible index Finally we arrive at an index with only 8 items but with a very high internal correlation between all the items and a very high Chronbach s Alpha Exercise Compute the index as shown above and find the average score on the index for target and non target groups in each of the four districts Exercise Create an additive index for ownership of household consumer goods C20 Find the minimum maximum and average score for target and non target groups in each of the four districts 35 7 Multivariate analysis In this section we will go through two types of multivariate analysis i e analyses where we have one dependent and more than one independent variables Multiple and logistic regression There are a numbe
6. 0 TractorBusTru 0 No None 10 Right Scale 4 c2 ei Numeric 8 0 Electricity 0 No None 10 Right Scale 5 c20fi Numeric 8 0 Radio 0 No None 10 Right Scale 5 c20gi Numeric 8 0 Television B No None 10 Right Scale Z c20hi Numeric 8 0 Telephone or D No None 10 Right Scale 8 c20ii Numeric 8 0 Tefrigerator B No None 10 Right Scale Aa hl mim o n Din aan Dias N kl hana 4n Dus Daala Distributions are usually displayed using percentages We will come back with some additional hints on presenting the data in e g graphs in the final section of the paper EXERCISE Use the frequency and find the percentage of respondents with different income levels remember B20 2 m percentage of respondents in different age ranges 4 2 Central tendency The central tendency of a distribution is an estimate of the centre of a distribution of different values There are three major types of estimates of central tendency Mean Median Mode The mean or average is probably the most commonly used method of describing central tendency The median is the score found at the exact middle of the set of values The mode is the most frequently occurring value in the set of scores We can get the mean median and mode by using the frequencies command in SPSS The following is an illustration of how to estimate these values for the age variable B4 17 Z t a le E c 19 mu nung u
7. 19 2064 12 0 2 21 09 19 2064 12 ol 2 33 The first result shows a Chronbach s Alpha of 0 78 It is above the requirement of 0 70 Reliability Statistics Cronbach s Alpha Based on Cronbach s Standardized Alpha Items N of Items However are all items to be included in the index Let s go to the Item Total Statistics box Item Total Statistics Scale Corrected Squared Cronbach s Scale Mean if Variance if Iter Total Multiple Alpha if Item Item Deleted Item Deleted Correlation Correlation Dele giai SMEAN glal 795 SMEAN g132 1 SMEAN g 123 gla4 1 SMEAN g1a4 gla5 1 SMEAN g1a5 glab 1 SMEAN g1a5 gla 1 SMEAN g1a7 gla8 1 SMEAN g1a8 gla9 1 SMEAN g1a9 glal 1 SMEAN g1a10 glal1 1 SMEAN g1a11 One can see from the result that by removing two of the items one would get a Chronbach s Alpha that is higher than 0 784 In order to get an index that to the largest possible extent measure one concept access to amenities we would consider removing glal 1 and glall 1 drinking water and electricity from the index Conceptually this makes sense as drinking water and electricity are normally not facilities that are associated with other types of services that are listed in the index Instead of the index above we should therefore rather have made an index including only the other items in the list Since it is an indicator of access to services we change the name COMPUTE serv ind
8. Eta C Risk C McNemar C Cochran s and Mantel Haenszel statistics R 28 However consult statistics handbooks to be sure that you apply the correct measures and for how to interpret the results One general guide is the following Bivariate Associations Level of gt Measurement vs Measure of Association Contingency C Cramers V Lambda Phi Yules Q Contingency C Cramers V Lambda Phi Yules Q Interval Ratio Contingency C Cramers V Lambda Phi Yules Q Gamma Somers d Taua Taub Tauc A range of other measures are used here but we will not discuss them Interval Ratio A range of other measures are used here but we will not discuss them Pearson sr Le the correlation co efficient f From http salises mona uwi edu sem1_08_09 SALI6012 Data_Analysis Data 20Analysis pdf 29 6 Creating additive indexes A concept is usually much richer than any single measure of it Therefore both reliability and validity may be enhanced by developing a number of measures of the same underlying concept and then combining them into a scale or index An index can be created simply by adding the values of the individual measures that make it up For example in the CNAS survey there is a question G1 asking about access to facilities Any person could either answer yes or no of each of the facilities By adding up the number of positive answers one would presumably get an index of acces
9. between variables and to what extent we are able to draw conclusions from our findings A precaution would be to require a stronger association and require a lower significance level than we would normally do if we had drawn a completely random sample For example while confidence intervals are usually set to 95 and significance tests are based upon 5 significance levels these could be increased to 99 and 1 respectively to compensate for the described imprecision 5 There is software available also in SPSS which handles complex sample designs but such software is yet not available to researchers in the project We should also be open about the limitations to readers of our analysis and for example not argue that we can draw conclusions about the whole country of Nepal Let us now go back to the two examples above and look at measures of association between the variables Which measures that are appropriate to use depends on the measurement level nominal ordinal or scale interval ratio A research question could for example be formulated as follows Is source of water in the house yard associated with group belonging target vs non target groups Our preliminary finding showed rather large differences between groups in Dhanusa but not so big differences between groups in Sindhupawlchuk Surkhet and Banke It seems district differences are larger than group differences in the districts with an exception for Dhanusa We
10. considered both a nominal and an ordinal variable 26 When we come to nominal by scale as is the case with group district nominal and health care expenses scale we use other measures of association Our research question is to find out whether household expenses to health care D17a are associated with group affiliation and or district Eta is the appropriate measute for this Go to the Compare Means under the Analyze scroll down menu Click Opsions and then tick the Anova table and eta in the window that comes up then Continue and OK 32004 2 ol 1 1 1 09 19 2064 Jo 1 1 1 7 1 32064 10 0 1 1 1 09 19 2064 ue 1 1 1 7 2 32064 10 0 1 1 1 09 19 2064 2064 10 0 1 1 1 09 19 2064 32064 10 0 1 1 1 09A Statistics Cell Statistics Median Mean Grouped Median 3 Number of Cases Std Error of Mean Standard Deviation Sum Minimum Maximum Range First Last Dependent List L di7a Layer 1 of 1 E cat_chn f E member E dhanusa ndependent List L weight d L aoup E am_ind district Variance Kurtosis Std Error of Kurtosis Skewness Std Error of Skewne Harmonic Mean x wLApova table and eta 32064 12 0 2 1 1 09 19 2064 Jo Test for Iineanty 320844 12 0 2 1 1 09 19 2064 SEEK 32084 12 0 1 1 1 09 19 2064 T Continue cmm we J 372064 12 0 1 1 1 09 9 2064 The results give an Eta squared of 0 11 which as shown in the ANOVA T
11. however Thus approximately 11 per cent of the variation in terms of availability of services is explained by the independent variables in the model The anova table tests the acceptability of the model from a statistical perspective 38 ANOVA b Sum of Mode Squares Mean Square Regression 1759 612 195 512 40 437 SUE Residual 13924 738 2880 4 835 Total 15684 351 2889 a Predictors Constant c32 Household Facilities Compared Intergenerational d surkh janjati a2 VDC Municipality low income Among the lowest 20 per capita household income muslim d banke dalit d sindhu b Dependent Variable serv ind The Regression row displays information about the variation accounted for by our model The Residual row displays information about the variation that is not accounted for by our model The regression and residual sums of squares are of different sizes and confirm that about 11 per cent of the variation in amenities level is explained by the model The significance value of the F statistic is less than 0 05 or 0 01 which is the significance level we have set due to the sampling imperfections explained in a previous section which means that the variation explained by the model is not due to chance Let us proceed to look at the coefficient table Coefficients Unstandardized Standardized Coefficients Coefficients Collinearity Statistics Model B Std Error Beta ig Tolerance VIF 1 Constant a2 VDC Munic
12. in the survey and application of such weights would complicate the analysis further it was chosen not to apply such weights Moreover the small number of missing households made it unnecessaty to apply weights for missing values For more on the application of weights for household surveys see for example http help pop psu edu help by statistical method weighting sampling weights literature review 13 4 Univariate analysis Univariate analysis involves an examination across cases of one variable at a time Usually we concentrate on the following three major characteristics of a single variable the distribution the central tendency the dispersion Let us go through all these characteristics for a single variable in our study 4 1 The distribution The distribution is a summary of the frequency of individual values or ranges of values for a variable The simplest distribution would list every value of a variable and the number of respondents who had each value We can for example describe the distribution of respondents in terms of their sex or their educational level This is done by listing the number or percentage of respondents of each sex or with different educational levels In these cases the variable has few enough values that we can list each one and summarize how many sample cases had the value With variables that can have a large number of possible values for example income B14 with relatively few p
13. muyt ora Oh rs Frequencies Statistics Variable s v Percentile Values Central Tendency b i rd 4 C Quattiles MES Cut points for v i Lcd C Cut points for equal groups Median C Percentile s ized Mode j v C Values are group midpoints T Dispersion Distribution E Std deviation C Minimum C Skewness E C Variance C Maximum F Kurtosis Numeric Of TractorBusTru D No FI Range FISE mean Numeric 8 Electricity B No Numeric 8 0 Radio B No None 10 Right Scale Numeric 8 D Television 10 No None 10 Right Scale For a continuous variable such as age with many values you usually don t want to display the frequency table so make sure that the Display frequency tables is not ticked 4 3 Dispersion Dispersion refers to the spread of the values around the central tendency The Standard Deviation is the most common the most accurate and a very detailed estimate of dispersion The standard deviation can be defined as the square root of the sum of the squared deviations from the mean divided by the number of scores MANUS one SPSS is capable of calculating the standard deviation for our variables The standard deviation allows us to reach some conclusions about specific scores in our distribution Assuming that the distribution of scores is normal or bell shaped or close to it then approximately 68 of the scores in the sample fall within one standard deviation of the mean
14. the Cox amp Snell R square that adjusts the scale of the statistic to cover the full range from 0 to 1 What constitutes a good R square value varies These statistics can be suggestive on their own but they are most useful when comparing competing models for the same data The model with the largest R squared statistic is best according to this measure In our case as seen in the table the R square varies between 0 11 and 0 15 43 The classification table shows the practical results of using the logistic regression model Classification Tables Predicted Perceived employment opportunity in government 0 Equal 1 Less Percentage Oppopbag ity opportunity Correct Wy TO Perceived employment 0 Equal opportunity opportunity in government 4 Less opportunity Overall Percentage a The cut value is 500 Without knowing the background characteristics of our respondents if we were to guess their score on the job opp variable we would simply guess less opportunity for all respondents this would be the correct answer in 60 of the cases However by knowing the background charactetistics on the independent vatiables we improve out guess by 6 as shown by the classification table the Percentage correct is now increased to 65 8 For each case the predicted response is Yes if that case s model predicted probability is greater than the cutoff value specified in the dialogs in this case the default of 0 5 Cells on
15. 9 c 1 1 09 19 2054 a orrelate gt s D09 1 Regression gt 1 1 09 19 2064 09 1 Loglinear gt 1 1 09 19 2064 _ D9 1 Classify gt 1 1 09 19 2064 09 8 Data Reduction gt 1 1 09 19 2064 09 15 SS gt Reliability Analysis og 1 Nonparametric Tests gt Multidimensional Unfolding Ag Time Series Multidimensional Scaling PROXSCAL 1 Dy T Survival gt Multidimensional Scaling ALSCAL 4 Dos Multiple Response gt 09 15 2064 m Missing Value Analysis _ 09 15 Complex Samples gt 1 1 09 19 2064 09 18 Quality Control gt 1 1 09 19 2064 09 19 ROC Curve 1 1 09 19 2054 09 19 1 1 09 19 2064 09 19 2064 2 55 1 1 1 09 19 2064 2 Select the 11 new variables in the potential index and tick the boxes as shown below and click Continue and in the next Window OK GE m ce wh ATE BLA DS la Reliability Analysis r x 1 Reliability Analysis Statistics Descriptives for Inter Item Item Correlations Scale C Covariances Scale if item deleted C Means None C Variances OF test C Covariances O Friedman chi square C Correlations Cochran chi square 1 1 1 1 Summaries ANOVA Table EJE E 4 1 C Hotelling s T square C Tukey s test of additivity 1 1 C Intraclass correlation coefficient 1 1 Model Two Way Mixed Type Consistency Confidence interval 95 Ye Test value 0 20 20 09
16. Optimal Binning i gt Rank Cases 2x e Y Date and Time Wizard Create Time Series Replace Missing Values Random Number Generators 4 Run Pending Transforms 09 19 2064 10 12 O9Mtg 2064 10 O 13 0992064 2 55 aal nnnunmnnca4 al zzi al Ji a d d d d Name and Method 03 15 NETS Change 9 1 Method Series mean Span of nearby points s Number 14 14 09 19 2064 2 55 15 NAMANA 2 55 You make the index based on these new variables 31 An additive index can be created by simply adding up all the values COMPUTE amen ind g1a1_1 g1a2_1 g1a3_1 g91a4_1 g91a5 1 g1a6_1 gia7_1 g1a8_1 g1a9_1 g1a10_1 gia11_1 We have now created an index of access to amenities with a potential score from 0 no amenities to 11 all amenities Let us look at the central tendency and dispersion of the index Statistics amen_ind N Valid 2890 Missing 0 Mean 5 6632 Median 5 5277 Mode 4 00 Std Deviation 2 64331 Minimum 00 Maximum 11 00 We see that the average mean score on the index is 5 7 Some households have access to no while some households have access to all 11 amenities However to what extent do all of the items included in the amenities index really measure the same con
17. able is a statistically significant result The derived output indicates a high likelihood that the association between the group belonging and health care expenses will be present in the population Thus it is highly likely that this association is found not only in our sample but exists in the real world in our four districts combined 8 00 Muslims Banke 9 00 Others Banke Total 20394 401 8691 282 18044 9956 8404 75 5124 24 7820 03 122 456 2890 ANOVA Table Sum of Squares df Mean Square F Sig di7a Health Between Groups Combined 1 1E 010 8 1343312066 4 160 Pe Care group within Groups 9 3E 011 2880 322908020 1 Total 9 4E 011 2888 Measures of Association Eta Eta Squared d17a Health ewe _ Cm 27 Exercises Are there statistically significant district level differences Are differences between groups statistically significant in all districts split file You now have the tools to conduct bivariate analysis for different types of variables The box in the statistics window shows what types of measurements are appropriate for different types of variables It edit editz al az aza a3 a4 Crosstabs Statistics C Chi square C Correlations Nominal Ordinal Cancel C Contingency coefficient C Gamma C Phi and Cram r s V C Semers d C Lambda C Kendall s tau b C Uncertainty coefficient C Kendall s tau c Nominal by Interval C Kappa C
18. are not treated here Nevertheless one can hardly overestimate the importance of these preparatory phases The appropriate methods of data analysis are determined by your data types and vatiables of interest the actual distribution of the variables and the number of cases In the case of the CNAS data set these parameters are given for those who wish to analyse the data It is important to have an initial understanding of the survey data set that is used for this manual The CNAS data set was collected in four districts of Nepal Dhanusa Sindhupawlchuk Surkhet and Banke In each district the aim was to have 600 respondents but 1 200 in Dhanusa with two target groups Of these 400 were to be selected from the target groups Tarai Dalits and Yadavs in Dhanusa Tamangs in Sindhupawlchuk Hill Dalits in Surkhet and Muslims in Banke The remaining 200 were to be selected among the non target groups general population In each district a stratification took place whereby 20 research sites were selected For selection procedures and overall survey methodology see the CNAS project report This manual requires some familiarity with SPSS for Windows Thus it will not cover the more general procedures in SPSS There are a number of SPSS courses available for students and researchers to familiarize themselves with the programme and it is recommended that some basic skills are already developed before getting to work on the CNAS data which is a r
19. ather complex data file When you receive the CNAS data set the following preparatory work has already taken place A measute is valid if it actually measures the concept we are attempting to measure It is reliable if it consistently produces the same result 2 oe Forthcoming in the autumn 2009 Data have been entered into a data file in SPSS for Windows with cases the respondents in rows and with variables based on survey questions in columns This is what you find if you look at the data file in Data view In the Variable view you find all the variables in Columns and some characteristics of each variable which you are allowed to change in columns Some key variables have been recoded or computed into new variables that were not originally in the questionnaire based on combining responses from two or more variables or regrouping responses on one variable The variable and value labels should explain these new variables For example age at birth has been recoded into age groups Missing values and variable types see later have been assigned to all variables where relevant Before using the data you should save it as your own working data file in order to preserve the original data In case you make an error you can then the revert to the original data file It is very often useful to save all the syntax you use for computing new variables then you can simply run the syntax file again if your working data file
20. c For example We would like to know what factors that explain why some people feel they have not equal opportunities as other people in their community to have access to employment in government jobs Our dependent variable is civil society membership 1 not equal opportunity 0 equal opportunity First we compute a new variable which we call job opp Job opportunity for example using this syntax For a more thorough introduction to logistic regression analysis you should consult a statistics handbook 40 recode d7 2 1 1 0 else copy into job_opp missing values job opp 3 thru high variable labels job opp Perceived employment opportunity in government val lab job opp 1 Less opportunity 0 Equal opportunity format job opp F2 0 freq job opp The results show that only 4 in 10 of the respondents believe they have equal job opportunity job opp Perceived employment opportunity in government Cumulative Frequency Percent Valid Percent Percent 0 Equal opportunity 1 Less opportunity Total Missing 8 9 System Total Then we think of which independent variables to include in the model Our selection of independent variables should be guided by some assumptions about possible relationships For an exploratory model which can all the time be refined we include the following variables Ethnicity eth new District district Age b4 Sex b3 Povert
21. ce of water in the house yard However if we do district wise analysis which we should do according to our sample design we get the following result Symmetric Measures district Survey district Value Approx Sig 1 00 Dhanusa Nominal by Phi 181 000 Nominal Cramer s V 181 000 N of Valid Cases 1157 2 00 Sindhupawlchuk Nominal by Phi 045 274 Nominal Cramer s V 045 274 N of Valid Cases 578 3 00 Surkhet Nominal by Phi 035 403 Nominal Cramer s V 035 403 N of Valid Cases 578 4 00 Banke Nominal by Phi 060 146 Nominal Cramer s V 060 146 N of Valid Cases 578 a Not assuming the null hypothesis b Using the asymptotic standard error assuming the null hypothesis Only in Dhanusa are there statistically significant differences between target and non tatget groups It seems that differences between districts are more important in explaining variation between groups than differences between target and non target groups in districts This is strengthened by the following table with association between district and C22 Symmetric Measures Nominal by Nominal N of Valid Cases e V a Notassuming the null hypothesis b Using the asymptotic standard error assuming the null hypothesis The association measured by Phi and Cramer s V are almost equally large between district and C22 as between group and C22 Phi and Cramer s V ate appropriate to use when we deal with two nominal variables C22 can be
22. cept One common way to test this is to make the generally reasonable assumption that the composite index is more valid and reliable than any one of the items that make it up We can correlate each individual item in the index with the score on the composite index A low correlation would indicate that a particular item is not closely related to the index That item could then be dropped and the index recalculated We usually also perform re ability analysis for the index as a whole A commonly used measure of an index s reliability is the Cronbach s Alpha o This measure is calculated from the number of items making up the index and the average correlation among those items The higher the value of Alpha the more reliable the index The value of Alpha generally ranges from zero to one However a negative value is technically possible A score of at least 70 is generally considered acceptable for creating an index The reliability analysis can be performed in SPSS in the following way 1 In the data window choose Analyze then Scale and select Reliability Analysis FINIS Transform Graphs Utilities Window Help P Reports gt Descriptive Statistics b 1 Tables gt int Compare Means General Linear Model b 1 1 09 A 9 206 4 Qs drei ed idi 1 1 09 19 2064 A 09
23. common in certain districts than in others The results after split file by district and weight by weight d is shown in the following table 4 f See previous sections for how to do this c22 Availability Source of Water in Home yard Cumulative district Survey district Frequency Percent Valid Percent Percent 1 00 Dhanusa 2 00 Sindhupawlchuk Valid 3 00 Surkhet 4 00 Banke It shows distinct district wise differences Let us now proceed to see if our target groups are more or less likely to have source of water than the rest of the population We can use the cross tabs command to do this In the row field we enter the group variable in the column box we enter C22 Pe amp ou t D A TE AeA w sn 1 s Crosstabs M 1 09 2 FENG 1 09 L enum D L group 1 09 4 Hint 1 09 y x Columns 1 09 5 1 09 L ewl D 2 e E enu2 8 1 1 09 supv Layer 1 of 1 d edi 1 09 D L edit 1 09 L edi2 1 D9 1 098 J 1 09 14 C Display clustered bar charts a s C Suppress tables 1 09 JN C a 9 19 09 19 2064 12 D 2 1 1 098 We click on Cells and then click on observed counts and Row percentages to get percentages as well as the observed cases Wil Crosstabs 1 D9 1 D9 1 D9 Crosstabs Cell Display 1 09 Counts ad Observed 1 D9 C Expected 1 yi 1 D9 Percentages Residuals 1 09 Row C U
24. cs b4 Complete age N Valid 18665 Missing 0 Mean 26 07 Median 21 00 Mode 10 Std Deviation 19 689 Minimum 0 Maximum 111 Note the maximum of 111 is it a realistic value in Nepal or is it an outher error that should be recorded as a missing value 19 5 Comparing groups Bivariate analysis Much of what we are interested in when analysing the CNAS survey data is to compate groups of the population in terms of their risk of social exclusion for a set of indicators Key variables for comparison are 1 Target and non target groups in each district 2 Districts In addition we can compare groups based on a large number of variables such as age educational level household size and composition dependency ratio in household male or female household head urban rural settlement ethnicity caste religious affiliation income levels economic status land ownership and so on We can use descriptive statistics to do so Inferential statistics test hypotheses about the data and may permit us to generalize beyond our data set Examples include comparing means averages for a given measurement between several different groups The simplest form of comparing groups is to use the split file command remember to apply weights and to obtain frequency means standard deviation etc for the four districts separately Let us first do a frequency distribution to find out if having a source of water in the house yard is more
25. d on E member et E dhanusa 4 weight d Ld E group am ind E hhmember Sort the file by grouping variables 4 hhsize M OF ile is already sorted Current Status Analysis by groups is off L 16 16 09 19 2064 2 55 1 1 1 09 19 2064 17 09 19 2064 2 55 1 1 1 09 19 2064 18 0919 2064 12 0 2 1 1 09 19 2064 19 19 09492084 12 O 2 1 1 09 19 2064 mn mn nanmamngss 1 n 1 1 nanannsea Before weighting we had the following distribution of respondents belonging to target and non target groups in each district target1 Target Population Cumulative district Survey district Frequency Percent Valid Percent Percent 1 00 Dhanusa Valid 1 Selected Ethnic Group 2 All Others Total 2 00 Sindhupawlchuk Valid 1 Selected Ethnic Group 2 All Others Tota 3 00 Surkhet 1 Selected Ethnic Group 2 All Others Tota 4 00 Banke 1 Selected Ethnic Group 2 All Others Tota However after weighting we get the following distribution 1 00 Dhanusa 2 00 Sindhupawichuk target1 Target Population district Survey district Frequency Percent 1 Selected Ethnic Group 2 All Others Tota 1 Selected Ethnic Group 2 All Others Tota 11 Cumulative Valid Percent Percent 3 00 Surkhet 1 Selected Ethnic Group 2 All Others Tota 4 00 Banke 1 Selected Ethnic Group 2 All Others Tota For explorative purposes however we may
26. e In the winter and spring of 2008 the Centre for Nepal and Asian Studies CNAS Tribhuvan University and Shtrii Shakti S2 in close collaboration with the Norwegian Institute for Urban and Regional Research conducted two large scale household surveys as part of a 3 year project on social inclusion and exclusion in Nepal The aim of this manual is to demonstrate step by step a variety of the techniques that can be effectively applied for data analysis of the complex survey data There are examples of basic analysis techniques as well as more advanced techniques that enable the researcher to answer complex questions that cannot be answered through simpler forms of analysis It is our hope that the manual will be useful for students of quantitative methodology in Nepal and especially those who engage with the topic of inclusion and exclusion A training course on quantitative survey analysis was carried out in Kathmandu in November 2008 and much of the manual is based on input before during and after this course It is meant to be very practically oriented with a focus on applied methodology and analysis The reader should be familiar with basic statistics or be aided by statistics handbooks during the work with this manual Also the manual requires access to a survey data set We decided to use the CNAS data set which is the most comprehensive in terms of dimensions of exclusion This data set can be provided free of charge to enrolled studen
27. e C32 expetienced improvement low much improvement Note that groups and districts are converted into dichotomous dummy variables First in the data file choose Analyze in the scroll down menu then select Regression and Linear om Graphs Utilities Window Help P Reports mn we Descriptive Statistics gt ype Tables gt ric Compare Means ric General Linear Model gt Em Generalized Linear Models Availability S 1 Yes Non 1 Mixed Models bL T T ric ime taken ta None Non Correlate E Ll pu ie tres a ne Loglinear gt Curve Estimation pe Non ne Classify a OK tj Non ric Data Reduction gt nay Ove a Non 1 Multinomial Logistic Scale Non Ordinal ric Nonparametric Tests v Non g Probit ric Time Series gt icd Survival gt Nonlinear ru Multiple Response gt Weight Estimation Missing Value Analysis 2 Stage Least Squares ne Complex Samples gt sacs MC Quality Control L PA ROC Curve ih Non ric ource of Light 1 Yes Non ric 8 n Source of Light 1 Yes Non In the window that appears select the dependent variable serv_ind and the independent You may wish to run optional analyses such as checking for collinearity histograms etc but we will not do so here 37 2 manual 060709 sav DataSet 1 SPSS Data Editor a Ml inear Regression 4 m 3 Dependent d d d
28. e in a new name and use it as your new working file Hint go to the variable view of your data file Define measurement level in the box to the right under Measure 3w Help gt 9 Label Values Missing Columns Align Measure orm number None None 4 Right Scale numerators c None None 5 Right Scale Jate of Intervie None None 10 Right Scale Jour of Intervie None None 4 Right Scale Ainute of Interv None None 4 Right Scale lumber of Visi None None 4 Right Scale znumeration R 1 Complete None 13 Right Scale lame of Super 1 Surendra M None 5 Right Scale Jate of Edit None None 10 Right Scale dit Time Hour None None 5 Right Scale dit Time Minu None None 4 Right Scale district 1 Dhanusha None 10 Right Scale DC Municipal 1 YDC None 7 Right cluster Name None None 5 Right Mard number None None 3 Right k fousehold nu None None 6 Right jg Nominal cluster Name 1 Andupatti Y None 15 Right Nominal survey status 1 Head of the None 11 Left Scale teletion of HH 1 Head None 7 Left Scale zex 1 Male None rd Left Scale complete age None None 6 Right Scale 3 2 Cleaning the data During the exploratory data analyses we assess the need to clean our data Data cleaning is extremely important and especially when the data collection method allows inconsistencies All data cleaning work should be carefully documented and available in a report Data cleaning includes among others the following
29. e indices below 3 3 Weights Since the number of certain target groups make up a larger share of the sample than their share in the population we get biased results unless we weight for such discrepancies Therefore based on population data in the four selected districts those groups that are over represented Tarai Dalits and Yadavs in Dhanusa Tamangs in Sindhupawlchuk Hill Dalits in Surkhet and Muslims in Banke are given a weight the variable is called weight_d so that their proportion in the analysis reflects their proportion in the population The same goes for all other groups In order to apply these weights do the following 1 When in the Data window choose Data and Weights select weight_d edit Do not weight cases P cho ind 1 09 19 P ca pe 9 Weight cases by 1 09 19 nember Frequency Variable 1 0y 4 dhanusa e T 1 09 19 4 group 1 098 19 47 am ind ___ Current Status Weight cases by weight n 1 09 19 1 09 19 09 19 2064 0 1 1 1 09 19 9 09 19 2064 10 0 1 1 1 09 19 10 09 19 2064 10 0 1 1 1 09 19 11 09 19 2064 10 1 1 1 09 19 10 However note that the data are not representative of Nepal as such To get correct results for each district one should split file by district and treat each district separately Split File V2 ol4x O Analyze all cases do not create groups L pix Compare groups chn ind Organize output by groups L cat chn Groups Base
30. educ is equal to 0 825 which means that the odds of default for a person who has SLC or higher education are 0 825 times the odds of default for a person who has 1 10 grade schooling which again are 0 825 times the odds of default for a person who is literate but without schooling and so on all other things being equal Values higher than 1 increase the odds a value lower than 1 decreases the odds Let us then interpret our findings Accotding to our model the following vatiables contribute to our model District District is the variable clearly mostly associated with perceived job opportunity Compared to Banke people in Sindhupawlchuk and Surkhet have greater likelihood of perceiving lack of job opportunities while the situation in Dhanusa is quite similar to that in Banke The score on the consumer goods index is also very highly associated with the dependent variable the more access to consumer goods the less likely a person is to perceive lack of job opportunities Perception of lack of job opportunities increases with increasing age Education has the opposite effect Income citizenship status and 45 membership in organisations do not contribute much to the model and should possibly be deleted It is noteworthy that ethnicity caste or religious belonging using our division into four major groups is not decisive for perception of lack of job opportunities As a further check we can build a model using backward stepwise meth
31. eople having each value we group the raw scores into categories according to ranges of values you need to know how to recode variables to do this and if you don t you could find it in a manual on SPSS One of the most common ways to describe a single variable is to make a frequency distribution Depending on the particular variable all of the data values may be represented or you may group the values into categories first For variables such as age B4 income B14 total working days B16 it is not sensible to determine the frequencies for each value Rather the values are grouped into ranges and the frequencies determined for each range of values Frequency distributions can be depicted in two ways as a table or as a graph The table below shows an age frequency distribution with five categories of defined age ranges based on variable B4 Frequencies DataSet3 H Nepal methods workshop cnas sutvey sav Statistics broadage Broad Age Group N Valid 18665 Missing 0 broadage Broad Age Group Cumulative Frequenc Percent Valid Percent Percent 1 00 to 14 2 15 to 24 3 25 to 39 4 40 to 59 5 60 and Over Total Missing 0 Age Not Reported Total Note that those who have not reported their age are defined as missing value This is done in the variable view of the data window in SPSS Ha moe tb WEE JS gt 0 No None 8 B x None 8 10 No None 8 Q No missing values 0 No
32. er is the extent to which they are able to use them Statistically significant variables however ate urban rural residence people in urban areas have significantly better access and households facilities compared with the past those who have experienced improvements have better availability of services Both of these findings are plausible More interestingly however is the impact of district Compared to people in Dhanusa control group people in Sindhupawlchuk and Surkhet have on average more services available while people in Banke have fewer and the results are statistically significant Finally people with low income tend to report lower availability of services but the significance level is on the margin we have defined it as 0 01 and in this case the relationship is not statistically significant When the tolerances are close to 0 there is high multicollinearity and the standard error of the regression coefficients will be inflated A variance inflation factor greater than 2 is usually considered problematic and the highest VIF in the table is 1 411 Thus in this model we do not seem to have a problem of multicollinearity 7 2 Logistic regression While linear regression is useful for dependent variables at interval or ratio scale level binary logistic regression is most useful when you want to model the event probability for a categorical response variable with two outcomes typically yes or no have or have not et
33. es 3 Multivariate analysis of three or more variables In the following we will start by discussing the main principles of exploratory data analysis It will be followed by examples of univariate bivariate and multivariate analysis techniques involving both descriptive data analysis and inferential statistics 3 Preparing the data for analysis Exploratory analysis and data cleaning The first task once the data is collected and entered is to ask What do the data look like Exploratory data analysis uses numerical and graphical methods to display important features of the data set Such exploratory data analysis helps us to highlight general features of the data and thereby direct our further analyses In addition exploratory data analysis is used to highlight problem areas in the data One should particularly ask the following What do the distributions look like for key variables To what extent do the data need cleaning for consistency Should outliers values that are far from the other values in the distribution be included or excluded in the analyses Are there many cases and variables with missing data and how should such missing data be handled 3 1 Distribution of the data First we go through the data file and investigate the shape of the data Where do most of the values lie Are they clumped around a central value and if so are there roughly as many above this value as below it We look at the dist
34. evision a ae SEER Select Sort by Statistic either Ascending or Descending according to your taste and Apply After editing some more your chart will look something like this 57 Figure x x Percentage of households in Banke with different types of household consumer items Bicycle Electricity Radio Television Telephone Refrigerator Motorcycle Bio gas Plant Solar System Heater Lamp Tractor Truck Bus Car Jeep 0 20 40 60 Per cent Additional advice when it comes to making graphs includes the following Make different versions of the graph and choose the one that is best suited For example should the graph s axis go from 0 or from somewhere else If you have continuous variables and wish to present more than averages income distribution etc it is sometimes useful to make a box plot In the box plot you can easily display the maximum and minimum values the middle of the data the spread of the data e g 25 and 75 percentiles and the skewness of the data See the box plot below for an imagined example 58 Maximum value 75th percentile 50th percentile 25th percentile Minimum value Be awate of outliers Other issues to consider are the use of colours don t use different colours rather shades for ordinal data don t use too bright colours which may cause optical illusions don t choose colour combinations that are difficult to distinguish re
35. han d sev ind e Block 1 of 1 EE d surkh D 4 d banke ona Independents l 1 L t8 E d banke d caste 4 low income E L dait c2 vw 4 iariati Method Enter v E muslim E L c9 Selection Variable dede m C L remitt dom L a 100 E int rem Case Labels Nor E dom_rem Nor 1 E weight_n NUM Nor E agegrp WLS Weight 4 Nor 4 educ m Nor E 1117 g1a1_1 Numeric 10 1 SMEAN g1a1 None Nor 1118la1a2 1 Numeric 10 1 SMEAN a1 a2 None Nor For different types of methods step wise forward backward etc consult statistics handbooks Here we use the default Enter method all independent variables are entered simultaneously into the model Let us first look at the model summary Model Summary Adjusted R Std Error of R Square Square the Estimate a Predictors Constant c32 Household Facilities Compared Intergenerational d_surkh janjati a2 VDC Municipality low_income Among the lowest 20 per capita household income muslim d_banke dalit d_sindhu In a multiple linear regression model adjusted R square measures the proportion of the variation in the dependent variable accounted for by the explanatory variables Unlike R square adjusted R square allows for the degrees of freedom associated with the sums of the squares Adjusted R square is generally considered to be a more accurate goodness of fit measure than R square they are very similar in our case
36. hers Banke Count 94 456 within group 20 6 100 0 Count 112 578 within group 19 4 100 0 22 We can see rather large differences between groups The highest share of those with source of water in the home yard are found among Muslims and Others in Banke then Yadavs and Others in Dhanusa The lowest percentage is found among respondents in Surkhet regardless of their group belonging Exercise Find group differences between target and non target groups in each district in terms of household ownership of land C1 Let us say that we are interested in finding the mean amount of Nepali rupies spent on health care in households during the past year by district and target non target group In the Data window go to the Analyze menu select Compare Means and enter as follows Dependent List IET Layer 1 of 1 Independent List 4 arcup el ae mE Crosstabs Title Notes You then get the following table indicating highest average health care expenses for Yadav households in Dhanusa followed by Others in Sindhupawlchuk The lowest are found among Tamangs in Sindhupawlchuk Hill and Tarai Dalits in Surkhet It is worth noting that Muslims in Banke have no lower average than other groups 23 Report d17a Health Care 1 00 Dhanusa 1 00 Yadavs Dhanusa 13398 14 26007 110 2 00 Tarai Dalits Dhanusa 3 00 Others Dhanusa 7752 20 13128 832 5645 34 13385 475 Tota
37. ic group you use the same procedure You can combine by writing e g B20 2 AND disttict 1 2 Types of data analysis It is common to differentiate between three different types of data analysis and we will go through all the three in the next chapters Exploratory Data Analysis Exploratory data analysis is used to quickly produce and visualise simple summaries of data sets We use exploratory data analysis mostly for arranging the data for further analysis Descriptive Data Analysis Descriptive statistics tell us how the data look and what the relationships are between the different variables in the data set We perform descriptive data analysis to present quantitative descriptions in a manageable form It should be noted that every time we try to describe a large set of observations with a single indicator we run the risk of distorting the original data or losing important detail However given these limitations descriptive statistics provide a powerful summary that may enable comparisons across groups of people or other units Inferential Statistics Inferential statistics test hypotheses about the data that makes it possible to generalize beyond our data set We will come back to inferential statistics in the section below on comparing groups It is also common to differentiate between the three following types of statistical analyses 1 Univariate when one variable is analyzed 2 Bivariate analysis of two variabl
38. ipality dalit janjati muslim d_sindhu d_surkh d_banke low_income Among the lowest 20 per capita household income c32 Household Facilities Compared Inergenerational a Dependent Variable serv ind Standardized coefficients or beta coefficients are the estimates resulting from an analysis performed on variables that have been standardized so that they have variances of 1 We want to answer the question of which of the independent variables have a greater effect on the dependent variable but know that the variables are measured in different units of measurement From the table we can see that the Beta coefficients ate highest for C32 perceived improvements in household facilities and A2 urban rural type of settlement To determine the relative importance of the significant predictors we should therefore rather look at the standardized than the unstandardized coefficients Even though C32 has a smaller 39 coefficient than d_sindhu and d_banke C32 contributes more to the model because it has a larger absolute standardized coefficient The analysis shows that the group belonging of respondents is not a statistically significant variable in explaining different levels of availability of services in the community when other variables in the model are controlled for This makes sense since all people in the village regardless of their caste ethnicity or religion will have services available another matt
39. l 8566 81 16319 144 2 00 Sindhupawlchuk 4 00 Tamangs Sindhupalchowk 5 00 Others 5027 09 12495 352 Sindhupalchowk 8659 13 21264 489 Total 7489 61 18955 244 3 00 Surkhet 6 00 Hill Dalits Surkhet 5491 75 13221 752 7 00 Others Surkhet 8500 25 26371 171 Total 7871 47 24241 196 4 00 Banke 8 00 Muslims Banke 8404 75 20394 401 9 00 Others Banke 6124 24 8691 282 Total 6605 43 12150 403 5 1 Bivariate measures of association and significance tests So far we have given descriptive bivariate statistics But as mentioned above in out research papers we often wish to make inferences from the sampled population to the population as a whole In the CNAS survey we can do this to some extent but we should also do so with great caution due to 1 We have drawn a sample only from four districts of Nepal 2 The sample design is complex while significance tests conducted in SPSS assume simple random sampling 3 Some groups are overrepresented in the survey This is compensated by weights but affects significance tests 4 The sample is drawn from villages with a certain proportion of both target and non target ethnic groups while mono ethnic environments were not included These conditions should not howevet restrict us from conducting significance tests and measure the strength of association between variables Even if our results are not completely accurate they nevertheless give a good indication of the correlation
40. le visual tasks For nominal variables it makes sense to place the bars in order of size In this way it is easy to see the order of responses Also if labels are long it is easier to fit them into the graph if the barchart 1s turned sideways When we have a number of items represented by different variables one can use the following procedure to get a good graph We are interested in the percentage of households in Banke with different types of household consumer items C20 First we select only households in Banke Select if District 4 Select Graphs Legacy dialogues and Bat 51 nalyze m Utilities Window Help s 4 Chart Builder Interactive gt Bar 3 D Bar ary WLG EE Line if district 1 Area Pie SIMPLE PCT BY h2 High Low Boxplot Error Bar 1 Population Pyramid ScatterjDot 2t1 H Nepal UD pa Histogram av Select Simple default and Summaries of separate variables then Define Bar Charts m Clustered m Stacked Data in Chart Are m Simple Summaries for groups of cases ey Select C20a to C20k and press Change statistic 22 Statistic 53 Press OK Now you will get an overview of all the households with ownership of the listed items 60 L40 lt al 20 0 Amenity Amenity Amenity Amenity Amenity Amenity Amenity Amenity Amenity Amenity Amenity Bicycle Motorc
41. member that many people are colour blind and the wse of symbols symbols require use of legend which may be distractive more than four symbols tend to overload short term memory certain symbols e g circles and squares are easily confused and especially if they are small
42. nds on the purpose of the chart Bar charts are usually better if the purpose is to compare individual pieces to each other Pie charts on the other hand are usually better when we wish to compare pieces to the whole 48 Figure x x Percentage of respondents in Dhanusa with different frequency patterns of listening to radio n 1157 LJ All the time E Often 3 sometimes i Rarely B Never The pie chart is good if we want to see how common the different categories are compated to the total A bar chart would give the following result 49 Figure x x Percentage of respondents in Dhanusa with different frequency patterns of listening to the radio n 1157 603 507 404 c Q o E 20 4 Dd 14 L i 0 T 1 T All the time Often Sometimes Rarely Not at all The bar is good if you want to see whether more respondents e g answer all the time compared to often Especially if you don t want to use the labels as in the figures below 604 507 407 Per cent 204 T T T T T All the time Often Sometimes Rarely Not at all 50 _JAll the time E Often P sometimes Ill Rarely Bl Never Also it is recommended to keep the graph simple and avoid three dimensional and other very fancy graphs as they tend to be distractive and more difficult to interpret A good graph relies on simp
43. nly in cases where B20 Survey status 2 Selected respondent where all the respondent and input is recorded This is also the case if you wish to Household level and individual level You do this by opting for Select Cases under Data in the scroll down menu tick for If condition is satisfied f E Select Cases e sn L enum OAl cases din 9 If condition is satisfied L ini s 2 Random sample of cases enu 4 enu2 Sample d supr Based on time or case range P d edit Range ie O Use filter variable 2 DC 3 b a2 E a3 Output Ga Filter out unselected cases s cl code Copy selected cases to a new dataset b20 4 5 Dataset name rA b3 Delete unselected cases L b4 v Select Current Status Do not filter cases 09 49 2064 12 09 19 2064 12 Click If under If condition is satisfied Select Cases If pog Functions ABS numexpr ANY test value value C3 WAG ARSIN numexpr ARTAN numexspr CDFNORM zvalue CDF BERNOULLI q p 1 09 19 20 1 09 19 20 Delete unselected cases 1 09 19 20 1 09 19 20 al rina n nme 4 b20 4 5 4 b3 4 b x In the empty window write b20 2 and click continue The first window comes back and click OK For the subsequent analysis you will only analyse cases for respondents or households If you wish to do analysis only for one district or only for one ethn
44. nstandardized 1 D9 C Column C Standardized 1 D9 C Total C Adjusted standardized 1 097 1 09 Noninteger Weights 1 097 Round cell counts Round case weights 1 097 O Truncate cell counts Truncate case weights 1 09 O No adjustments 1 09 1 D9 20 09 19 2064 12 D 2 1 1 og 21 F 09 19 2064 12 D 2 1 1 D9 We can also click on statistics but will come back to this later The results we get ate the following group c22 Availability Source of Water in Home yard Crosstabulation c22 Availability Source of Water in Home yard district Survey district 1 00 Dhanusa 1 00 Yadavs Dhanusa Count 123 81 204 within group 60 3 39 7 100 0 2 00 Tarai Dalits Count 29 70 99 Dhanusa within group 29 3 70 7 100 0 3 00 Others Dhanusa Count 524 330 854 within group 61 4 38 6 100 0 Total Count 676 481 1157 within group 58 4 41 6 100 0 2 00 Sindhupawlchuk group 4 00 Tamangs Count 56 130 186 Sindhupalchowk within group 30 1 69 9 100 0 5 00 Others Count 136 256 392 Sindhupalchowk within group 34 7 65 3 100 0 Count 192 386 578 within group 33 2 66 8 100 0 3 00 Surkhet 6 00 Hill Dalits Surkhet Count 22 99 121 within group 18 2 81 8 100 0 7 00 Others Surkhet Count 99 358 457 within group 21 7 78 3 100 0 Count 121 457 578 within group 20 9 79 196 100 095 4 00 Banke 8 00 Muslims Banke Count 104 18 122 96 within group 85 296 14 896 100 0 9 00 Ot
45. ods Backward methods start with a model that includes all of the predictors At each step the predictor that contributes the least is removed from the model until all of the predictors in the model are significant If the two methods choose the same variables one can be fairly confident that it s a good model 46 8 Presenting your findings making tables and graphs How to visualize your findings depends on the purpose of your report or presentation For an academic audience used to reading tables this might be a preferred way to present your results However in oral presentations with power point policy briefs and papers targeted at a broader audience a graph very often is easier to interpret and provides an immediate visual impression of the results Here we will only make a few comments on the use of tables 1 For survey results based on a random selection of respondents and considerable standard errors it does not make sense to use decimals when presenting percentages of responses Decimals are slower to read and indicate a greater accuracy than is actually the case 2 It often makes sense to sort the rows so that the larger numbers stay at the top unless there are good reasons for not doing so 3 Usually we put comparisons of interest in vertically Use a smaller font than you would normally use in the text 5 Be sure to make a title explaining the table and give enough additional explanation so that it is not neces
46. r of other multivariate analysis techniques but we have selected two very commonly used techniques for different types of dependent variables and suggest that you master these two ones before you proceed to more advanced techniques 7 1 Multiple linear regression The aim of regression analysis is to estimate the effect or impact of a given independent variable on variation in the dependent variable In the case of multiple regression we control for all the other independent variables in the model We have already made an index for accessibility of services in the community We would like to see to what extent this level is affected by district group affiliation rural urban settlement household poverty and experienced improvements in facility level We use multiple linear regression to calculate how much the dependent variable service level changes when other variables independent change Here we assume some previous knowledge of multiple linear regression If you are not familiar with regression analysis you should first consult a statistics textbook Our aim is to show you how to perform such analysis in SPSS for Windows with the CNAS data set The dependent variable is serv ind service index Independent variables are A2 high urban low rural Group caste all caste groups dalit janjati and muslim District d dhan d sindhu d surkh d banke Poverty low income among the 20 households with lowest incom
47. ribution for each variable to determine which analyses would be most appropriate Types of analyses are also determined by the types of the variables nominal ordinal or scale levels In SPSS you can specify the level of measurement as scale numeric data on an interval or ratio scale ordinal or nominal A variable can be defined as nominal when its values represent categories with no intrinsic ranking Examples of nominal variables in our data set include VDC municipality A2 sex B3 ethnicity B6 and religious affiliation B7 A variable can be defined as ordinal when the values represent categories with some intrinsic ranking for example levels of satisfaction from highly dissatisfied to highly satisfied Examples of ordinal variables in the data set include attitude scores such as comparing income situation today with that of 25 years ago highly improved somewhat improved etc D15 and how person the respondent is to bea person of his her caste or ethnicity very proud somewhat proud etc O15 A variable can be defined as scale when the values represent ordered categories with a meaningful metric so that distance comparisons between values are appropriate Examples of scale variables from the survey include age in years B4 and income in Nepali rupies B14 Exetcise Go through the data file and check the variables Define them according to their measurement level Nominal ordinal or scale Save the fil
48. s to facilities which is better than any single item How do we do this in practice First we take a look at the distribution of responses Remember that Select cases B20 2 should be selected The responses are 1 yes 2 no 8 do not know and missing First we rearrange recode so that no 0 and don t know is defined as a The syntax for doing this is RECODE giai g1a2 g1a3 g1a4 g1a5 g1a6 gla7 g1a8 g1a9 g1a10 gla11 220 EXECUTE VALUE LABELS g1a1 g1a2 g1a3 g1a4 g1a5 g1a6 g1a7 g1a8 g1a9 g1a10 gia11 1 Yes 0 No 8 Do not know MISSING VALUES g1a1 g1a2 g1a3 g1a4 g1a5 g1a6 g1a7 g1a8 g1a9 g1a10 g1a11 8 We cannot assume that all the missing values don t have access We have two options either exclude them from the analysis that means that if a respondent for some reason has a missing value for only one of the 11 items he or she will be excluded from this index or create new variables where the missing values and the don t know are ascribed the average number of all the other responses In the following example we have ascribed the average value to missing cases so that they will be included in other analyses 30 Edit View Data Transform Insert Format Analyze Graphs Utilities Window Help Compute Variable Count Values within Cases Recode into Same Variables Recode into Different Variables Automatic Recode Visual Binning
49. sary to read the text to understand the table Let s give an example We are interested in how often people in the four districts read newspapers The SPSS raw output gives a table like this 47 h2 Listen to Radio district Survey district Crosstabulation district Survey district 2 00 1 00 Sindhupa Dhanusa wichuk 3 00 Surkhet 4 00 Banke h2 Listen i Count 157 56 69 58 340 to Radio within distric Survey district 2 Mostly Count 122 124 120 50 416 96 within distric Survey district 3 sometimes Count 14 7 32 1 54 96 within distric Survey distric 4 Rarely Count 215 185 123 156 679 96 within distri Survey distric 5 Not at all Count 649 206 234 313 1402 96 within distri Survey distric Count 1157 578 578 578 2891 96 within distri Survey distric 13 676 9 7 11 9 10 0 11 8 10 5 21 5 20 8 8 7 14 4 1 2 1 2 5 5 2 1 9 18 6 32 0 21 3 27 0 23 5 56 1 35 6 40 5 54 2 48 5 100 0 100 0 100 0 100 0 100 0 This can be made into a table like this Table x x Frequency of listening to radio by district Percentage of randomly selected respondents n 2891 Dhanusa Sindhupawlchuk Surkhet Banke Never 56 36 41 54 Rarely 19 32 21 27 Sometimes 1 1 6 0 Often 11 22 21 9 All the time 14 10 12 10 n 1157 578 578 578 When making graphs for univatiate distributions is it better to use a pie chart or a bar chart The answer is that this depe
50. suddenly contains errors that you are not able to remove You do this by saving the data with a new name that is easy to identify e g Save as CNAS_aaal sav You can save as many data files as you wish but of course they make up some space on your hard drive You can also put the date in the name of the data file so that it is easy to see when it was created e g CNAS_ 220909 sav You will need a CNAS survey questionnaire to analyse the data so that you can see the wording of each question The variable names usually reflect the code for each variable in the questionnaire Thus the questionnaire contains sections from A to S in addition to some administrative variables most of which you find at the beginning of the data file The data are normally sorted according to the letters in the alphabet but you can also sort them according to when they appeared in the data file The CNAS survey data enable three types of analysis 1 Analysis on all household members mostly from section B 2 Analysis on the household as such A section most of C section much of D section etc 3 For one randomly selected individual in the household most of the remaining sections Itis very important to note that the data file contains data on each individual in the household Thus as it is it is mostly suited for analysis in section B If you wish to carry out analysis on the randomly selected individual the respondent you should do analysis o
51. the diagonal ate correct predictions 413 and 1167 Cells off the diagonal are incorrect predictions 276 and 546 The predictors and coefficient values are used by the procedure to make predictions The table summarizes the effect of each predictor 44 Variables in the Equation eth new 6 059 eth new 1 192 eth new 2 3 619 eth new 3 830 district 122 307 district 1 039 district 2 54 827 district 3 35 851 b4 37 780 b3 4 794 low income 1 860 educ 13 203 member 2 645 am ind 1 47 311 hhfem 8 148 ri 060 Constant 45 702 1 098 1 577 1 241 1 027 312 385 980 800 1 195 825 1 216 809 1 801 971 13 123 a Variable s entered on step 1 eth new district b4 b3 low income educ member am_ ind 1 hhfem r1 3 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 The ratio of the coefficient to its standard error squared equals the Wald statistic If the significance level of the Wald statistic is small normally less than 0 05 but in our case it has been set to 0 01 due to sampling imperfections then the parameter is considered useful to the model The meaning of a logistic regression coefficient is not as straightforward as that of a linear regression coefficient While B is convenient for testing the usefulness of predictors Exp B is easier to interpret Exp B represents the ratio change in the odds of the event of interest for a one unit change in the predictor For example Exp B for
52. tor and last this means that in your results the reference categories will be Muslims and Banke which are those the other categories will be compared with Click Continue and OK there are many more options but they will not be explained here Let us first take a look at the Model summary It presents two different R square values Model Summary 2 Log Cox amp Snell Nagelkerke R Step likelihood R Square Square a Estimation terminated at iteration number 4 because parameter estimates changed by less than 001 In the linear regression model see above the coefficient of determination R square summarizes the proportion of variance in the dependent variable associated with the predictor independent variables with larger R square values indicating that more of the variation is explained by the model to a maximum of 1 For regression models with a categorical dependent variable it is not possible to compute a single R squared statistic that has all of the characteristics of R square in the linear regression model so two approximations ate computed instead The following methods are used to estimate the coefficient of determination Cox and Snell s R square is based on the log likelihood for the model compared to the log likelihood for a baseline model However with categorical outcomes it has a theoretical maximum value of less than 1 even for a perfect model Nagelkerke s R square is an adjusted version of
53. treat the survey population where each district counts the same in the final analysis It is recommended to always use the weight_d variable if we do not split the analysis on target and non target group This has implications on the results See for example results with and without applying weights for proportion of households respectively with and without Television C20g in the four districts If weights are not applied c20g Amenity Television district Survey district Frequency Percent 1 00 Dhanusa 2 00 Sindhupawlchuk Missing Total Valid Missing Total Valid Percent Cumulative Percent 3 00 Surkhet Valid Missing Total 4 00 Banke Valid Missing Total If applying weights 12 c20g Amenity Television Cumulative district Survey district Frequency Percent Valid Percent Percent 1 00 Dhanusa Missing Total 2 00 Sindhupawlchuk Valid Missing Total 3 00 Surkhet Missing Total 4 00 Banke Valid Missing Total Exercise Check differences in other results when applying or not applying weights How do you interpret the differences in results One can also choose to apply weights for correction of differences between analysis of 1 Randomly selected individuals 2 All members of households as these groups have different probabilities of being selected Howevet since household size is not closely connected with key exclusion variables tested
54. ts and researchers by approaching CNAS We would like to thank all those in CNAS and S2 who have contributed to the two surveys and the people they have hired to participate in sample design data collection data entry and data cleaning Particularly we wish to thank project coordinator Professor Dilli Ram Dahal of CNAS Furthermore Associate Professor Bidhan Acharya Population Studies Tribhuvan University has been in charge of the sampling design used for the CNAS survey and has prepared the data for analysis We also thank Berit Willumsen for help in preparing the manuscript for publication Finally we are very grateful to the Ministry of Foreign Affairs of Norway for its generous financial support Oslo September 2009 Marit Haug Research Director Project Leader 1 Introduction to the CNAS 2008 survey data Data analysis will never provide good results unless the data are of good quality Therefore already in the preparation phase of a project great care needs to be taken to use operational definitions that are valid and reliable measures of concepts This manual is based on an existing data set from a survey on social exclusion and inclusion in Nepal Preparations for data analysis starts already in the planning phase of a sutvey with questionnaire design and procedures for sampling As this manual is primarily concerned with data analysis techniques topics such as questionnaire design sampling and other preparatory work
55. want to test the null hypothesis that there is no difference between groups For this analysis we have variables at the nominal level and Phi Cramet s V are appropriate We select Crosstabs again and click on the box for Statistics and then tick the box for Phi and Cramer s V 1 Visible 1129 of 11 si Crosstabs 1 Crosstabs Statistics C Chi square C Correlations Nominal Ordinal Cancel C Contingency coefficient C Gamma v Phi and Cram r s V CO Somers d C Lambda C Kendall s tau b C Uncertainty coefficient C Kendall s tau c t E Row s E sn S enm mill 7 Aint A m Column s int E enul LJ d m E enu2 L supr Layer 1 of 1 4 edi ean Nominal by Interval C Kappa 1 1 1 1 1 1 1 1 1 S ei 1 TEn Risk La LJ 1 4 2 3 1 C MeNemar 1 1 1 1 1 1 C Cochran s and Mantel Haenszel statistics i C Display clustered bar charts C Suppress tables 09 19 2064 09 19 2064 09 19 2064 Anan Exact Statistics L Cells Format 19 09 19 2064 12 o 2 1 an Anan 4 4 n a 4 4 74 74 74 A SY a 04 0A 00 0E 0A 04 aj The result is shown below Symmetric Measures Nominal by Nominal m V N of Valid Cases 2891 a Not assuming the null hypothesis b Using the asymptotic standard error assuming the null hypothesis 25 This shows statistically significant associations between group belonging and likelihood of having a sour
56. y income among 20 lowest low income Education educ Civil society membership member Household consumer goods level am ind 1 Female head of household hh fem Citizenship r1 Perhaps you could think of other variables that should be included In the data window select Analyze Regression and Binary logistic regression Select your dependent variable job opp and your independent variables 41 r1 Citizenship with All Eligibles lil ogistic Regression E glal_1 Dependent ad 4 a21 4 iob opp L gla3 1 Block 1 of 1 4 glad paie Previous E gla5_1 A gla6_1 Covariates gla 1 educ 4 giaB 1 member 5 am ind 1 E gla3 1 ab hhfem 4 a0 r1 giat 1 Method E amen ind L serv ind Selection Variable vel gt J Rule membe dicate Enter Some of the variables district new_eth are categorical and need to be defined as such Click the box Categorical and select these two as categorical 8 Dependent 3 7 16 Logistic Regression Define Categorical Variables 3 8 Covariates Categorical Covariates 10 4 b4 eth new Indicator 45 d b3 istrict Indicator 12 E low_income 5 E educ 8 E member 8 E am_ind_1 E hier Change Contrast 8 Ou Contrast Indicator v 8 Reference Category 9 Last First A l 21 13 p TTOTTE 8 C1 and 2 Lan None 8 hliimnarie And eumorchilf1 Vacl iE 42 Default is indica
57. ycle Car Jeep Tractor Electricity Radio Television Telephone Refrigerator Bio gas Solar Truck Bus Plant System Heater Lamp NA pre REC ateta d The next steps are a good way to edit the figure First we want to turn the graph sideways Doubleclick the graph and start to edit it within the Chart editor window 54 mak Options Elements Help ww bm PF A LEZ CELERE i H v v B z AF W H kk We r can SRS Sons See SENS e S BES ARREARS oes Click the symbol indicated in the above figure Transpose chart coordinate system This gives the following figure 55 i Chart Editor File Edit view Options Elements Help ie o EIXYEg mN EILZ ICEL E B La se U Nui V LA Amenity Solar System Heater Lamp Amenity Bio gas Plant Amenity Refrigerator Amenity Telephone Amenity Television Amenity Radio Amenity Electricity Amenity Tractor Truck Bus Amenity Car Jeep Amenity Motorcycle Amenity Bicycle 40 in 1 1 Cases weighted by weight d Now you can statt to edit the chart First you would like to select the order from high to low 56 Doubleclick on the bars The following Properties window appeats Properties Size Fill Border Bar Options Depth amp Angle Variab Variable Variables e Collapse sum categories less than Yo L1 Categories Categories le n x Amenity Radio Amenity Tel

A User Manual for SPSS Analysis

Contents

Download Pdf Manuals

Related Search

Related Contents