Home

MACROS GTIVE Generic Tool for Important Variable

1. 32 7 11 IGV data for 10 stage design igv e3c des 33 7 12 T AXI Features that influence Compressor Pressure Ratio the most a 34 7 13 T AXI Features that influence Compressor Pressure Ratio the most b 34 7 14 Stringer stress analysis Feature scores estimated by GTIVE 35 7 15 Stringer stress analysis Approximationerrorratlo 36 7 16 Fuel System Analysis Features scores and Approximation error ratio 37 DYN ND VAN 0 AN EADS COMPANY Chapter 1 Introduction 1 1 What is GTIVE Generic Tool for Important Variable Extraction GT IVE is a software package for per forming global sensitivity analysis on user provided data In the 13 sensitivity analysis is defined as the study of how the variation uncertainty in the output of a statistical model can be attributed to different variations in the inputs of the model In other words it is a technique for systematically changing variables features in a model to determine the effects of such changes 1 2 Documentation structure Documentation for GTIVE includes e User manual this document which contains A general overview of the tool s functionality Short descriptions of the algorithms Recommendations on the tool s usage Examples of applications to model problems e Technical reference 3 for C and Python API which includes Description of system requirements Install
2. amp hc bi ime bee b Oe RIS Be WS oo 14 3 3 1 Feature scores a oe 4b eos be Be eee eS ee ee Se ees 14 3 3 2 Standard deviation 2 4 624 64 deo OE ASS RPE SE KE GEN 15 4 User configurable options 16 AI RED 28 4 4 34 4 bed ee RE ee ea eG be OE Be ee sika 16 4 2 Mutual Information Kraskovestimate 16 4 3 Mutual Information Histogram based estimate 17 4 4 SMBFAST SurrogateModel BasedFAST 18 4 5 Elementary Effects 2 226 4445 6 sa ee ka ee ee we 19 4 6 Extended FAST Fourier Amplitude Sensitivity Testing 20 5 Limitations 22 6 Selection of technique 24 6 1 Selection of the technique by the user 24 6 2 Default automatic selection 24 il DATADVANCE iii CONTENTS AN EADS COMPANY 7 Usage Examples 27 7 1 Artificial Examples ml mke Be a RO f a BO 27 7 1 1 Example 1 simple function no cross feature interaction 27 7 1 2 Example 2 usage of confidence intervals to determine redundant vari EP e emk m e ee we a ee ee es iy 30 7 1 3 Example 3 difference between main and total scores in FAST 31 7 2 Real world data examples ska rk erik mad seddi 32 Tl T AX problem e ernest 644 eh db DEER Se inekle 32 7 2 2 Stringer Super Stiffener Stress Analysis problem 35 7 2 3 FuelSystemA nalysisproblem 37 References 38 Inde
3. the feature For example x and x2 were defined in kilograms and we want to change the measurement units of z to grams In this case though new values of rescaled x would become 1000 times larger but it s feature score would remain the same 2 4 State of the art methods There are lots of approaches to the problem of global sensitivity analysis 5 13 12 7 14 8 Technique appropriate for each task depends on the problem conditions and user requirements We ve designed the GTIVE tool to include the most effective state of the art methods covering different problem settings In this section brief overview of the techniques used in the GTIVE is provided We may group sensitivity analysis techniques in two big groups o Methods that can work with any sample o Methods that require sample of a particular structure to work Generally the methods of the second group are more precise but due to the sample form requirements one usually needs to have an interface to the considered function to be able to generate required specific sample For each situation different techniques are implemented in the GTIVE and we refer to them as sample based and black box based correspondingly 2 4 1 Sample based techniques These techniques require some data sample X Y given where X X i 1 K Y Y i 1 K components of input vector X i Mp Y f X K is the total number of samples In the GTIVE the following sample b
4. 2 for uniformity GTIVE scores are given after taking the square root As expected since the function is not linear and monotonic the first three techniques gave inaccurate results 2 7 Remark on the selection of techniques for GTIVE The selection of techniques for GTIVE was associated with different factors 1 The need to provide basic modes of operation reliable linear solution on a small sample RidgeFS e medium size sample from 50 to 500 points Mutual Information kraskov large sample from 200 to several hundred thousand points Mutual Information histogram black box with small budget from 2 inputDimension 1 to 2000 Elemen tary Effects 11 DATADVANCE CHAPTER 2 OVERVIEW AN EADS COMPANY Technique X Y Pearson 59 41 Spearman 78 22 RidgeFS 74 26 Mutual Inf kraskov 33 67 Mutual Inf hist 34 66 Elementary Effects 35 65 FAST 34 66 Table 2 2 Pearson s and Spearman s correlation coefficients and GTIVE techniques e black box with large budget from 65 inputDimension to hundreds of thousands FAST 2 The popularity of techniques e RidgeFS is a standard linear estimate e Mutual Information is a widely used technique for feature selection in biology medicine image processing e g see 17 11 10 16 e Elementary Effects is a standard screening technique based on computation of average partial derivatives and
5. 8 40 DATADVANCE AN EADS COMPANY Index Options Accelerator 18 Deterministic 20 21 MinCurveNum 20 NumberOfCVFold 18 NumberOfNeighbors 17 NumberOfSearchCurves 21 RankTransform 17 18 Seed 20 21 SensitivityIndexesType 19 21 SurrogateModelType 19 Technique 24 VarianceEstimateRequired 13 41
6. Friedman The elements of statistical learning data mining inference and prediction Springer 2008 E Kitanin Air Evolution Research in Fuel Systems 4 Technical report IRIAS 2010 I Kononenko An adaptation of relief for attribute estimation in regression 1997 A Kraskov Estimating mutual information Physical review E Statistical nonlinear and soft matter physics 69 40 79 2004 H Liu Relative entropy based probabilistic sensitivity analysis methods for design under uncertainty aiaa 2004 4589 10 th AIAA ISSMO Multidisciplinary Analysis and Optimization Conference 2004 F Maes Multimodality image registration by maximization of mutual information IEEE transactions on Medical Imaging 16 pages 187 198 1998 P Qiu Fast calculation of pairwise mutual information for gene regulatory network reconstruction Computer Methods and Programs in Biomedicine 94 2 177 180 2009 A Saltelli A quantitative model independent method for global sensitivity analysis of model output Technometrics 41 39 56 1999 A Saltelli Global Sensitivity Analysis The Primer Wiley 2008 B Schulkopf Measuring statistical dependence with hilbert schmidt norms In Algo rithmic Learning Theory 3734 63 74 2005 V Schwieger Variance based sensitivity analysis for model evaluation in engineering surveys Data Processing pages 1 10 2004 38 DATADVANCE BIBLIOGRAPHY AN EADS COMPANY 16 H Sundar Robust computat
7. inputs on all design space Let Y f X X RP Y R be some considered dependency f X may be some physical experiment or a solver code Without loss of generality only the case of g 1 will be considered below If g gt 1 the model has many outputs each output is treated 4 DATADVANCE CHAPTER 2 OVERVIEW AN EADS COMPANY independently GTIVE procedure calculates score w for each feature x from a feature set X 21 2 also known as input vector such that higher score reveals more sensitivity higher variations of the output Y with respect to the variations of the corresponding input The scores are positive numbers generally between 0 and 1 higher score indicates that the variable is more important There are several different techniques implemented in the tool the precise meaning of the score is technique dependent For a sensitivity analysis technique we wish it to share the following properties o If one variable is more important than the other in a technique defined way we want it s score to be higher o We want feature scores to be proportional to the corresponding variables influence so that comparing scores one would get the idea of relative importance of variables These properties allow to rank features in the order of importance and give the idea of approximately to what extent one feature is more important than other features 2 2 Quality metrics To compare techniques performance the follo
8. of the original design space Column removals also produce a warning to log As a result we obtain a reduced matrix X Y consisting of the submatrices X and Y Accordingly we define effective input dimension p as the number of columns in X and effective sample size N as the number of rows in X Y 3 Next sample values in the X and Y matrices are normalized so that for each component of the input and output its mean equals 0 and standard deviation equals 1 Ti Ti Yi Yi ti IT Vi 3 1 o x o vi e This is the last sample preprocessing step if not using the Mutual Information tech nique This means that for RidgeFS and SMBFAST techniques the scores are estimated using the normalized reduced matrix rather than the original matrix X Y Mutual Information technique includes one more preprocessing step below 4 The Mutual Information technique is known to possibly show some performance degra dation when feature values are distributed over a uniform grid which is the case after the normalization Due to this in case of using the Mutual Information tech nique whether Kraskov or histogram estimate a small scale uniform noise in range 10 19 10 is applied to all input and output components If rank transform is on see option RankTransform the noise is applied after the transform Thanks to its small scale it does not have any significant effect on the final results while the robustness of the Mutual Inf
9. on this sample It gives satisfactory results on 500 points and good on 1000 points 28 CHAPTER 7 USAGE EXAMPLES DATADVANCE AN EADS COMPANY Table 7 5 Example 1 FAST scores 29 Sample size T Lo T3 vA 5 True 0 0181 0 0727 0 1636 0 2909 0 4545 500 0 0339 0 0963 0 2589 0 2744 0 3362 750 0 0442 0 0824 0 1638 0 2370 0 4723 1000 0 0273 0 0808 0 1681 0 2697 0 4538 DYN DV NN 0 CHAPTER 7 USAGE EXAMPLES AN EADS COMPANY 7 1 2 Example 2 usage of confidence intervals to determine re dundant variables In this example we will demonstrate how knowing confidence intervals can tell us whether the function depends on the feature or not For simplicity let us consider the function f 21 22 3 Z4 5 T 2125 0 0123 x 1 1 i 1 2 3 7 2 1 2 3 Here the function depends very weakly on z3 We generate 200 points random sample for this function and apply GTIVE in this case Mutual Information kraskov algorithm will be used Results for scores and the standard deviation of scores the square root of estimated variance of scores are provided in the table 7 6 Sample size Ly To 3 Scores 0 7494 0 2506 0 0 stdScores 0 1019 0 0745 0 0516 Table 7 6 Example 2 GT IVE scores and the standard deviation of scores Using confidence intervals one may additionally check whether we can trust obtained score v
10. the s score is NaN e In the blackbox based mode if the generation region see 2 4 2 is defined in such a way that the lower and upper bounds of some feature are equal this feature is interpreted as a constant input so its resulting score will be NaN similarly to the sample based mode with a constant column Note that GTIVE can t handle features collinearity For instance if the values of two features are always equal they are assigned equal scores while in reality it is possible that the output is totally insensitive to the first feature and changes its value only due to the change of the second feature This is one of the examples of a degenerate data sample and such features have to be filtered out before passing data to GTIVE 3 3 2 Standard deviation Standard deviation matrix D is structurally similar to the score matrix each element oj is the standard deviation of the 5 score In general 0 is a non negative real number except the case than s score is NaN In this case oj is also set to NaN Note that standard deviation is calculated only when VarianceEstimateRequired is on else the D matrix is empty 15 DYN ND VANN 0 AN EADS COMPANY Chapter 4 User configurable options GTIVE combines a number of scores estimation techniques of different types By default the tool selects the optimal technique compatible with the user specified options and in agreement with the best practice experience Alterna
11. 0 0181 0 0727 0 1636 0 2909 0 4545 30 0 0152 0 0754 0 1782 0 2811 0 4213 100 0 0193 0 0721 0 1691 0 2952 0 4415 Table 7 2 Example 1 Elementary Effects scores Results for Mutual Information Kraskov estimate are presented in the Table 7 3 Kraskov estimate gives satisfactory results on 30 points and quite close to True on 500 points Sample size T 2 T3 TA T5 True 0 0181 0 0727 0 1636 0 2909 0 4545 30 0 1058 0 1051 0 0963 0 26478 0 4279 100 0 0867 0 0785 0 1220 0 2562 0 4563 500 0 0366 0 0774 0 1375 0 2772 0 4711 Table 7 3 Example 1 Mutual Information Kraskov estimate scores Results for Mutual Information histogram estimate are presented in the Table 7 4 As expected Histogram based estimation of Mutual Information is inferior to Kraskov estimate on small samples but still manages to do close to True estimation Sample size Ly T x3 4 Ls True 0 0181 0 0727 0 1636 0 2909 0 4545 30 0 0622 0 0 0656 0 2988 0 5733 100 0 0 0856 0 1486 0 2513 0 5142 500 0 0059 0 0315 0 1287 0 2914 0 5422 750 0 0084 0 0585 0 1501 0 2958 0 4868 1000 0 0 0513 0 1725 0 3020 0 4740 2000 0 0039 0 0609 0 1791 0 2967 0 4591 Table 7 4 Example 1 Mutual Information histogram estimate scores Results for FAST are presented in the Table 7 5 FAST needs at least 65 x 6 390 points to work
12. 36 DYN ND VANN 0 CHAPTER 7 USAGE EXAMPLES AN EADS COMPANY 7 2 3 Fuel System Analysis problem e Problem description The objective of the Research into Fuel Systems project is to deliver application that can predict pressures and mass flows for gravity feed aircraft fuel systems 6 The desktop application comprises a two phase flow air and fuel analysis engine that is derived from experimental observations One of the task the MACROS models are used for in this project is to approximate pressure loss coefficient and volume flow quality of the fuel flow on the diaphragm section of the pipe using experimental data Experimental data is a 244 points sample with 6 features describing fuel flow flow velocity V pressure after the diaphragm P temperature T densities of fuel Pfuet and air Pair ratio of diaphragm diameters r and two outputs pressure loss coefficient C and volume flow quality Q We will use GTIVE to determine which features should be measured with the most accuracy This is very important for experimental design if the feature is unimportant then we shouldn t do additional expensive experiments in order to explore the depen dence of the outputs C and Q on this feature and we can measure this feature with less precision in the experiments e Solution workflow 1 We have a sample of experimental data so sample based technique is going to be used 2 GTIVE scores were computed with the d
13. DATA OZ AN EADS COMPANY MACROS GTIVE Generic Tool for Important Variable Extraction DATADVANCE AN EADS COMPANY 2007 2013 DATADVANCE lle Contact information Phone 7 495 781 60 88 Web www datadvance net Email support datadvance net Technical support questions bug reports info datadvance net Everything else Mail DATADVANCE llc Pokrovsky blvd 3 building 1B 4 floor 109028 Moscow Russia User manual prepared by Pavel Erofeev Pavel Prikhodko Evgeny Burnaev BLAND VANN 0 AN EADS COMPANY Contents List of figures iv List of tables v 1 Introduction 1 LI WhatisGTIVE 1 1 2 Documentation struct re oa ses sises ceire 1 1 2 Overview 3 2 1 Problem statement ooa a a a 4 22 Quality metrics se cec semat eae ke SET a Ea ROGER e GEES 5 6 6 6 2 3 Input Definition Domain Importance oo soo aa a 2 4 State of the art methods ooa a a a 2 4 1 Sample based techniques 2 42 Black box based techniques 9 2 5 Scores variance estimation o oo ooo a a 11 2 6 Remark on other sensitivity analysis methods 11 2 7 Remark on the selection of techniques for GTIVE 11 3 Internal workflow 13 3 1 General workflow sa saga na bee a 13 92 KM Meke emire OE BOBS SSS Eee E R a 13 3 3 F le ya
14. In this approach pdf of x Y and pdf of x Y are estimated using histograms For example pdf of x is estimated as DK Ie z h 2 s h 2 Kh i where h is a bin size I is an indicator function In the GTIVE implementation the cross validation approach is used to estimate optimal histogram bin size h see 5 If the sample size is at least 20000 points then accelerated optimization procedure for the bin size selection is used plz 2 8 Pros Works fast Can handle small as well as large data sets Sample of few dozens points is suffi cient to catch the most important features As the sample size increase resolution grows Robust to noise and outliers Cons Cant handle feature interdependencies e SMBFAST Surrogate Model Based FAST Surrogate Model Based FAST is a complex approach combining the surrogate model ing paradigm and the idea of black box analysis with the extended FAST method see 2 4 2 Currently all GTApprox techniques except the Mixture of Approximators and Geostatistical Gaussian Processes are available in SMBFAST for training the inter nal surrogate model and same features and restrictions apply see the GT Approx manual 2 for details Due to the model training overhead SMBFAST may be time consuming but it is the most accurate of all currently implemented sample based techniques Pros The most accurate of all currently implemented sample based techniques I
15. VarianceEstimateRequired option is on the result also includes score stan dard deviation std calculation is off by default For vector functions functions with multidimensional output feature scores and scores standard deviation are estimated for each component independently see Section 3 3 for the results description For individual technique descriptions see Chapter 4 and Section 2 4 3 2 Preprocessing As we work with initial training dataset some reasonable preprocessing must be applied to it in order to remove possible degeneracies in the data Let X Y be the N x p q matrix of the training data where the rows are p q dimensional training points and the columns are individual scalar components of the input or output The matrix X Y consists of the sub matrices X and Y We perform the following operations with the matrix X Y 1 Remove all exact duplicates search for rows in X Y containing the same data and if two or more matches are found delete every row except one since repeated data 13 BLAND VANN 0 CHAPTER 3 INTERNAL WORKFLOW AN EADS COMPANY points do not add any information warning is sent to log if there were any rows removed 2 Remove all constant columns in sub matrices X and Y constant column means that all the training vectors have the same value of one of the input components In particular for X this means that the training DoE is degenerate and covers only a certain section
16. alue lying inside the score 3 V variance score 3 V variance range So if zero is outside of this range one may decide that score is significantly larger than zero It means that corresponding feature has significant influence on the function value and this feature can be treated as important Actually estimation of true confidence intervals for scores is quite a complicated problem However we consider that our approximation for confidence intervals is sufficiently accurate to help in selection of the important features 2 6 Remark on other sensitivity analysis methods In this section we will discuss GTIVE methods with respect to well known Pearson s and Spearman s correlation coefficients Let us consider the limitations of these correlation coefficients e Pearson s correlation coefficient is suitable only for using with linear functional depen dencies There is an analog of such a technique in GTIVE namely RidgeFS e Spearman s correlation coefficient is suitable only for monotonic functions In GTIVE we do not make such assumptions for nonlinear techniques i e for all except RidgeFS To clarify these points we will give an example Let us consider the sensitivity analysis problem for a function f 2 27 x y 1 1 In this case nonlinear GTIVE techniques are supposed to identify correctly the presence of dependency and the influence of each variable on the output The results are summarized in the table 2
17. alues Score value for third feature is zero so it s contribution was not detected on this sample size To check if scores for the first and the second features are significantly larger than zero one should check if for i th feature zero belongs to the interval Score 3 stdScore Score 3 stdScore For the first feature Score 3 stdScore 0 4437 gt 0 For the second feature Scores 3 stdScoreg 0 0272 gt 0 which means that both scores with very high probability are significantly larger than zero And obviously this value is negative for the third feature 30 DATADVANCE CHAPTER 7 USAGE EXAMPLES AN EADS COMPANY 7 1 3 Example 3 difference between main and total scores in FAST In this example we will consider FAST performance for the function f v1 2 3 20129 23 1 1 i 1 2 3 7 3 that on the one hand is still simple enough to form some expectations of what true scores should be but on the other hand it already has some feature interactions So in this example one may expect to see x having the largest score x2 be on the second place and x3 be the least important feature We will use this example to demonstrate the difference between main and total FAST scores Main scores take into account only isolated variable contribution to the variance of output meaning that main scores would ignore influence of the x x term Total scores on the other side shou
18. ased techniques are implemented e RidgeFS In case the sample is small and so there is no benefit in using complex approaches feature scores may be estimated with linear model DYN VANN 0 CHAPTER 2 OVERVIEW AN EADS COMPANY It is assumed that Y Xb se b bj bp are some coefficients and e cti 1 K is zero mean white noise Coefficients b are estimated as b XTX AI XTY where I RP P and A is tuned using LOO CV approach see 5 Then feature score for i th variable is estimated as p b Jvar zi Jarfy 017 im 2 3 where var is a variance of the i th feature estimated using sample Pros Works fast Can handle very large data sets Best possible choice if the true model is linear Cons Not suitable for strongly non linear models e Mutual Information A group of techniques that estimate feature score by computing Mutual Information of considered feature and the output I x Y fe Ti Y log da dY 2 4 HDS RAM TESTIEG The idea is to measure how far the joint distribution p x Y of the feature and the output is from the case of two independent random values where p ai Y p x p Y The greater the difference the more relevant feature is Feature score for i th variable is estimated as me el m 2 5 In the GTIVE we adopted two techniques to estimate Mutual Information kraskov and histogram estimates Kraskov estimate gives more accurate results bu
19. ation steps Quick start guide C and Python API reference The present document has the following structure e Chapter 2 is an introduction to the tool s functionality It contains an overview of relevant sensitivity analysis concepts and explains the way the tool is applied and what results it produces e Chapter 3 describes the internal workflow of the tool DATADVANCE CHAPTER 1 INTRODUCTION AN EADS COMPANY Chapter 4 describes specific sensitivity analysis techniques implemented in the tool Chapter 5 describes limitations on the sample size for different techniques Chapter 6 describes how the sensitivity analysis technique is selected automatically in a particular problem Chapter 7 gives some examples of GTIVE tool use for some model and real world problems BLAND VAN 0 AN EADS COMPANY Chapter 2 Overview The main goal of GT IVE is to estimate feature scores for the user provided dependency which can be represented as data sample gt or interface to some black box 3 So it solves the problem of global sensitivity analysis As an illustration we give the following simple example Consider the Newton s law of universal gravitation Say we know that every point mass attract every other point mass but don t know what features affect that And say that for some reason we think that following features may affect the force of attraction e mj Mo the masses of the bodies e r distance between b
20. celerator except that 0 is also a valid value meaning that the setting will be automatically selected by the internal approximator e NumberOfCVFold Values integer in range 2 231 2 or 0 auto Default 0 auto select Short description The number of cross validation subsamples to estimate the vari ance of scores 18 DATADVANCE CHAPTER 4 USER CONFIGURABLE OPTIONS AN EADS COMPANY Description In order to estimate the variance of scores the principle of cross valida tion is used Cross validation involves dividing the input sample into a number of subsamples cross validation subsets This option sets the number of subsamples to divide in e SensitivityIndexesType Values enumeration total main Default total Short description Select the type of score index to be computed Description This option is a switch selecting the type of index computed by the FAST procedure used internally in SMBFAST Main index estimate is usually more reliable but this index takes into account only the influence of the considered feature on the output ignoring the influence of cross feature interactions Total index estimates total influence of the variable on the output taking into account all possible interactions between the considered feature and other input features but its estimate is generally less reliable e SurrogateModelType Values enumeration LR SPLT HDA GP HDAGP SGP GeoGP TA iTA RSM or Auto Defaul
21. data sample only Options e NumberOfNeighbors Values integer in range 1 0 8 effective sample size 1 Default 0 auto Short description number of nearest neighbors used to estimate mutual information Description Option specifies number of nearest neighbors used in estimation of mu tual information if kraskov technique is selected manually or automatically Increasing this value gives smaller variance of score estimation at the cost of higher systematic errors and vice versa Best practice recommend to set it as a small integer value of around 5 in most cases e RankTransform Values on off Default on Short description Apply rank transform copula transform before computing mu tual information Description If this option is on True rank transform is applied to the input sample before computing mutual information In most cases it allows for a more accurate mutual information estimate 4 3 Mutual Information Histogram based estimate Short name Hist General description Mutual information estimate of feature scores based on the his togram construction Also see Section 2 4 1 Strengths and weaknesses Too crude for small samples but have very low memory requirements so can be applied in the case of very large data sets If the sample size is at least 20000 then accelerated optimization of histogram parameters is used Tends to underscore features in case of heavy cross feature interac
22. e Exit Blockage 0 963 0 956 0 949 0 942 0 935 0 928 0 921 0 914 0 907 0 9 Stage bleed 0 0 0 0 1 3 0 2 3 0 0 0 Rotor Aspect Ratio 2 354 2 517 2 33 2 145 2 061 2 028 1 62 1 417 1 338 1 361 Stator Aspect Ratio 3 024 2 98 2 53 2 21 2 005 1 638 1 355 1 16 1 142 1 106 Rotor Axial Velocity Ratio 0 863 0 876 0 909 0 917 0 932 0 947 0 971 0 967 0 98 0 99 Rotor Row Space Coef 0 296 0 4 0 41 0 476 0 39 0 482 0 515 0 58 0 64 0 72 Stator Row Space Coef 0 3 0 336 0 438 0 441 0 892 0 455 0 886 0 512 0 583 0 549 Stage Tip radius m 0 3507 0 3358 0 3283 0 3212 0 3151 0 3084 0 3042 0 2995 0 297 o 0 2946 Table 7 9 Stage data for 10 stage design stage e3c des Mass Flow Rate kg s 54 4 Rotor Angular Velocity rpm 12299 5 Inlet Total Pressure Pal 101325 Inlet Total Temperature K 288 15 Mach 3 Last Stage 0 272 Clearance Ratio 0 0015 Table 7 10 Initial data for 10 stage design init e3c des e Solution workflow We perform the following steps to make the analysis 32 DYN DVN Ol CHAPTER 7 USAGE EXAMPLES AN EADS COMPANY Soldity 0 6776 Aspect ratio 5 133 Phi Loss Coef 0 039 Inlet Mach 0 47 Lambda 0 97 IGV Row Space Coef 0 4 Table 7 11 IGV data for 10 stage design igv e3c des We generate data sample of 109 points One may use available code as a black box as well but we didn t do it because code fails to compute outputs in many points On a given sample feature scores are estimated using GTIVE with default set t
23. efault settings Mutual information Kraskov estimate was used in this case see Section 4 2 3 To validate results we ve calculated the Approximation error ratio measure 2 2 for both outputs Table shows that GTIVE scores are in good agreement with feature scores see Table 7 16 Q V P T Pair Pfuel r GT IVE Score 0 7204 0 925 0 2697 0 0688 0 1731 0 6628 Approximation error if fixing feature full model error ial 9 ne Me gt ee C V P T Pair Pfuel r GT IVE Score 0 1888 0 1166 0 0843 0 0773 0 0944 0 4383 Approximation error if fixing feature full model error a ne De JE e ar Table 7 16 Fuel System Analysis Features scores and Approximation error ratio e Results It can be seen that values of scores are in good correspondence with errors of approximation 37 DYN DVN Ol AN EADS COMPANY Bibliography 10 11 12 13 14 15 E Burnaev Construction of the metamodels in support of stiffened panel optimization In Proceedings of the conference MMR 2009 Mathematical Methods in Reliability 2009 Datadvance Generic Tool for Approximation User manual DATADVANCE llc MACROS Generic Tool for Important Variable Extraction 2011 S Grihon Application of response surface methodology to stiffened panel optimiza tion In Proceedings of 47th conference on AIAA ASME ASCE AHS ASC Structures Structural Dynamics and Materials Conference 2006 T Hastie R Tibshirani and J
24. gh even in the case of strong variables inter dependencies Note that method actually allows some randomization so one can get different estimates by varying global Seed parameter 20 DATADVANCE CHAPTER 4 USER CONFIGURABLE OPTIONS AN EADS COMPANY Variance estimation Yes Restrictions Can be applied to the black box only Options Deterministic Values boolean Default on Short description require IVE process to be deterministic Description If this switch is turned on then all random processes in all algorithms are started with some fixed seed ensuring result to be the same on every run In the current version the switch affects only black box based techniques FAST and Elementaty Effects Seed Values integer 1 2147483647 Default 100 Short description change fixed seed when Deterministic is on Description Enables user to use different fixed seeds for IVE process In the current version the switch affects only black box based techniques FAST and Elementaty Effects SensitivityIndexesType Values enum total main Default total Short description selects type of score index to be computed Description Switch selects if the FAST procedure should compute main or total score index Main index takes into account only isolated influence of the con sidered feature on the output ignoring the influence of cross features interactions total index estimates total infl
25. he error measure can be defined as _ V lt F X fam Zi X gt V lt F X fsm X gt where lt gt is the sample mean Higher approximation error ratio means that i th feature is more important Err t 2 2 DATADVANCE CHAPTER 2 OVERVIEW AN EADS COMPANY 2 3 Input Definition Domain Importance It s important to note that the scores returned by GTIVE depend on the variation intervals of the factors If a factor is restricted to a very narrow interval then its score might be low even if factor is important Moreover the scores returned by GTIVE are invariant under changes of units of measurement for individual factors as long as changes are linear In such cases the effects of rescaled intervals are compensated by the corresponding changes in the response function For example consider the case when we have a function f 21 2 1 2 with z 1 1 and z2 1 1 It s obvious to expect x and x to have equal scores in these conditions Now let us expand x to region 2 2 while keeping f 21 72 the same In this case though in each point local importance of x and z remains similar on the global scale x2 would provide 4 times more variation to the output thus rising it s feature score It s equivalent to the case when we leave x at 1 1 and change function to f a 2 2 x 2 2 On the contrary consider the case when we change the measurement units of
26. ime FAST blackbox Table 5 1 Technique summary Contrary to the maximum size there is a certain minimum for the size of the training set or for the available number of blackbox calls which depends on the technique used As explained in Section 3 2 this condition refers to the effective values i e the ones obtained after preprocessing An error with the corresponding error code will be returned if this condition is violated The requirements on minimum sample size budget are summarized in Table 5 2 For most techniques there are two different limits depending on whether the calculation of scores standard deviation is required by user or not see option VarianceEst imateRequired Table 5 2 denotes the following 22 CHAPTER 5 LIMITATIONS DATADVANCE AN EADS COMPANY e p the effective input dimension after the sample preprocessing s the GTIVE SMBFAST NumberOfCVFold option value NN the GTIVE MutualInformation NumberOfNeighbors option value Cor responding limit is in effect only if the option is set by user NR the GTIVE FAST NumberOfSearchCurves option value limit is in effect only if the option is set by user x is the value of x rounded up to the next integer 65p NR NR gt 3 Technique Minimum size bugdet std calculation on std calculation off RidgeFS p 2 p 1 SMBFAST 22 s 2p 3 Mutual Information Kraskov 20 or MH 20 or NN 1 M
27. ings By default in this case histogram based estimate is used see 4 3 Estimated feature scores are plotted on the picture 7 1 Looking at the picture one may see that there are clearly 12 most influential features So it s natural to perform preliminary optimization of compressor varying only this 12 features instead of all 163 To validate the results of the GTIVE we estimated the Index of Variability 2 1 of different important feature subsets Z adding features one by one starting from the ones with higher GTIVE scores and from the lower scores Results are presented on the Figure 7 2 e Results In the Tables 7 12 7 13 the most important feature is filled with dark green next 11 important ones are filled with light green color Feature scores estimated by GT IVE sorted in descending order 0 T Feauture scores estimated by GT IVE GT IVE feature score log scale 0 20 40 60 80 100 120 140 160 Number of feature 180 Figure 7 1 T AXI Feature scores estimated by GTIVE Note This image was obtained using an older MACROS version Actual results in the current version may differ 33 DYN ND VANN 0 CHAPTER 7 USAGE EXAMPLES AN EADS COMPANY Index of variability for selected subset of features values of features that are not selected are fixed 20 i Start adding features with higher GT IVE scores to subset Start adding features wi
28. ion of mutual information using spatially adaptive meshes 17 18 19 Proceeding MICCAI 07 Proceedings of the 10 th international conference on Medical image computing and computer assisted intervention Part 1 950 958 2007 G Tourassia Application of the mutual information criterion for feature selection in computer aided diagnosis Med Phys 28 2001 M Turner A turbomachinery design tool for teaching design concepts for axial flow fans compressors and turbines Proceedings of GT2006 2006 S Vallaghe A global sensitivity analysis of three and four layer eeg conductivity models Biomedical Engineering IEEE Transactions 56 pages 988 995 2009 39 DATADVANCE AN EADS COMPANY Index design of experiment 4 feature score sensitivity index 4 5 global sensitivity analysis 4 6 GT IVE Generic Tool for Important Vari able Extraction 1 optimization 4 Options 18 Deterministic 20 21 MinCurveNum 20 NumberOfCV Fold 18 NumberOfNeighbors 17 NumberOfSearchCurves 21 RankTransform 17 18 Seed 20 21 SensitivitylndexesType 19 21 SurrogateModelType 19 VarianceEstimateRequired 13 Quality metrics Approximation Error 5 Index of Variablity 5 surrogate model 4 Techniques Black box based 9 Elementary Effects 9 19 Extended FAST Extended Fourier Am plitude Sensitivity Test 9 20 Sample based 6 Linear regression RidgeFS 6 16 Mutual Information 7 a SMBFAST
29. ld account all feature interactions In the manual dependency analysis comparison of these two indices allows for some investigation of the dependency nature We ve estimated these scores using 500 and 1000 points samples to show the difference in the results Total scores are presented in the Table 7 7 Sample size Ly Lo 13 500 0 4449 0 3985 0 1503 1000 0 4965 0 4125 0 0869 Table 7 7 Example 2 FAST total scores Main scores are presented in the Table 7 8 Sample size T Ta T3 500 0 3353 0 0019 0 6604 1000 0 5016 0 0033 0 4910 Table 7 8 Example 2 FAST main scores Let S71 Srs Sr3 be total indexes of variables and Sm1 Sm2 5Sm3 be main indices One may see that Sy X 0 Sy gt Smp it gives one a hint that x2 feature appears only in interaction with some other Also one may remember that Sr Smi interaction terms i e say 571 Say 512 513 where for the example Si is a term accounting for x and z interaction Notice also that Say X Sm3 Suya O and 571 amp Spo S73 gt Sp 513 amp S12 S93 S13 S23 gt S23 X 0 As a result we can make an educated guess that our function has the following form f 21 2 3 fi a1 folz3 f3 a1 2 fal 13 31 DYN VANN Ol CHAPTER 7 USAGE EXAMPLES AN EADS COMPANY 7 2 Real world data examples In this section we will show application of GTIVE to some real world data
30. mples In this section we will demonstrate performance of various techniques implemented in the GTIVE on some known artificial functions 7 1 1 Example 1 simple function no cross feature interaction In this example we will consider the function f x1 2 3 Ta 5 2 222 323 427 522 x 1 1 i 1 5 7 1 In this case we have no cross feature interactions So we can approximately estimate that true scores should have ratio 1 4 9 16 25 In this example we will refer to these scores as True We ve calculated feature scores with all methods for different sample sizes and presented comparison with our expectations of what true features might be in this problem in the tables below Results for RidgeFS are presented in the Table 7 1 As expected RidgeFS assumes linear dependency so methods fails to estimate correct scores Sample size T 2 x3 T4 Ls True 0 0181 0 0727 0 1636 0 2909 0 4545 30 0 1847 0 1628 0 1113 0 2425 0 2983 100 0 2164 0 2074 0 1687 0 2128 0 1944 500 0 1449 0 1876 0 2312 0 1505 0 2855 Table 7 1 Example 1 RidgeFS scores 27 DATADVANCE CHAPTER 7 USAGE EXAMPLES AN EADS COMPANY Results for Elementary Effects are presented in the Table 7 2 Elementary Effects gives satisfactory close to True results on 30 points sample already and very close results on 100 points Sample size Ly 19 T3 LA 5 True
31. ncorporates approximation capabilities of GT Approx Cons May take a long time as building of GT Approx model inside is required DATADVANCE CHAPTER 2 OVERVIEW AN EADS COMPANY 2 4 2 Black box based techniques These techniques generate new sample points during their work so they require connection to some black box function Y f X In case of black box based method term budget is used instead of sample size Note that in these methods one has to specify the region some hypercube where points are generated In the GTIVE the following black box based techniques are implemented e Elementary Effects is a screening technique able to work with relatively small sam ples The idea of Elementary Effects approach is to generate uniform in terms of space filling properties set of trajectories in the design space On each step of trajectory only one component x of input vector X is changed and the following function is estimated aje anm fot ee eo 2 9 where is a step size Score for i th feature is computed as NE gal SP 2 10 UNE 7 var Y a where pi J5 d X r is a number of steps changing i th feature value on all trajectories X is the input value at these steps A is a range of possible values for i th feature var Y is a sample variance of black box values on generated sample points Actually the method gives normalized estimate of average squared partial derivatives Pros Can provide
32. odies e T environment temperature p atmospheric pressure Li L bodies luminosity Mx m Figure 2 1 The Newton s law of universal gravitation Also we did 30 experiments and measured all the features considered and the correspond ing force of attraction Applying GT IVE to this task give us the following feature ranks see 2 1 In general the tool helps to answer the following questions also known as function or model 2also known as training data or samples 3some device system or object that provides output for a given input DYN VANN Ol CHAPTER 2 OVERVIEW AN EADS COMPANY hee el 0 19 0 20 0 61 0 0 0 0 0 0 0 0 Table 2 1 Illustration Scores for the Newton s law of universal gravitation problem 1 What features have no influence on the dependency and thus can be dropped in the further study 2 If we want to reduce the number of features considered in the problem which features should we drop 3 What features are the most influential so that they should be measured with the highest accuracy or have the highest variability in the Design of Experiments GTIVE calculates sensitivity indices features scores for each input variable feature That are the numbers that show relative importance of each feature in some sense Looking at the scores one can say if one feature is more important than the others and guess to what extent This information may be useful in the following ta
33. ormation technique is notably improved 3 3 Results The resulting output of GTIVE contains a feature score matrix S and if std calculation is on see option VarianceEstimateRequired a score standard deviation matrix D The size of both matrices is g x p the number of rows is equal to the output dimension g the number of columns is equal to the number of features or the input dimension p the original input dimension not the effective input dimension p 3 3 1 Feature scores Each element s of the S matrix is the sensitivity of the i th output component to the j th feature In general s is a positive real number except some special cases e In the sample based mode if the value of the j th feature in the sample is constant the X matrix contains a constant column all scores of this feature j th column in S are set to NaN special not a number value since there is no way to estimate the sensitivity of the output to a constant component 14 DATADVANCE CHAPTER 3 INTERNAL WORKFLOW AN EADS COMPANY e In the sample based mode if the value of the i th response component in the sample is constant the Y matrix contains a constant column the scores of all features vs this output i th row in S are set to 0 0 it is assumed that this output is insensitive to all features since its value is constant e The first of the above rules has priority if the sample contains both a constant feature x and a constant output y
34. problems 7 2 1 T AXI problem e Problem description In this problem we consider The T C_DES Turbomachinery Compressor DESign code meanline axial flow compressor design tool which is the first step of T AXI an axisymmetric method for a complete turbomachinery geometry design 18 Program tcdes e3c des exe is used for calculation of outputs f X for new generated inputs X Program can be downloaded from the link http gtsl ase uc edu T AXI Program uses a 163 dimensional feature vector describing geometry and the working condition as an input The task is to determine subset of the most important features for the Compressor Pressure Ratio With IGV output The dependency is considered only for X V X X xf 1 a x 1 a z i 1 163 where a 0 1 X xf g3 is given in Tables 7 9 7 11 Stage Parameter 1 2 3 4 5 6 7 8 9 10 Stage rotor inlet angle deg 10 3 13 5 15 8 18 19 2 19 3 16 3 15 13 6 13 4 Stage rotor inlet Mach no 0 59 0 51 0 475 0 46 0 443 0 418 0 402 0 383 0 35 0 313 Total Temperature Rise K 52 696 52 301 51 117 49 736 49 144 43 617 45 69 47 269 48 255 47 565 Rotor loss coef 0 053 0 0684 0 0684 0 0689 0 069 0 069 0 069 0 069 0 069 0 07 Stator loss coef 0 07 0 065 0 065 0 06 0 06 0 065 0 065 0 065 0 065 0 1 Rotor Solidity 1 666 1 486 1 447 1 38 1 274 1 257 1 31 1 317 1 326 1 391 Stator Solidity 1 353 1 277 1 308 1 281 1 374 1 474 1 379 1 276 1 346 1 453 Stag
35. recommended in 13 e FAST is a common way to calculate so called global sensitivity indexes The effi cient calculation of such indexes with FAST is described in 13 and 9 Examples of usage of this approach are given in 15 and 19 12 DYN DVN 0 AN EADS COMPANY Chapter 3 Internal workflow 3 1 General workflow As described in Section 2 4 GTIVE includes two types of techniques blackbox and sample based Main difference regarding the tool s internal workflow is that there is no prepro cessing step in the blackbox based mode since in this mode GTIVE generates the sample itself and ensures it has a correct structure and does not contain any degenerate data Con versely in sample based mode the sample analysis is essential because in general there are no guarantees for the sample quality Thus GTIVE internal workflow generally consists of the following steps 1 Preprocessing Only in sample based mode In this step redundant data is removed from the training set and the sample is normalized see Section 3 2 2 Analyzing training data and options selecting technique In this step training sample properties and options specified by user are analyzed for compatibility and the most appropriate estimation technique is selected see Chapter 6 3 Estimating feature scores and scores standard deviation In this step feature scores are estimated using the technique selected in the previous step If the
36. reliable estimates even for very small budgets Minimal number of black box function calls equals few times number of features which is sufficient to get estimation for not very complex cases Cons Generates trajectories randomly Not robust to outliers e Extended FAST Fourier Amplitude Sensitivity Testing is a technique suited for the case when cheap black box is available like surrogate model see 2 4 1 It requires quite many samples to estimate score The idea here is to measure what portion of output variance is described by the variance of the feature To do so for each feature main indices are estimated as Ve Ewa ri e a 2 11 number of function calls allowed for method DATADVANCE CHAPTER 2 OVERVIEW AN EADS COMPANY where V is a variance with respect to z E x is a conditional mean with respect to all features except x Instead of computing multivariate Monte Carlo esti mates method uses space filling one dimensional curves of the form 1 1 e z arcsin sin v s 2 12 T to generate sample points Here each feature have some frequency v assigned from some incommensurate set v s is the coordinate on one dimensional curve and is a some random constant phase shift Using Fourier decomposition in case of 2 12 we may say that co f X J Aj cos js Bjsin js j A gt slsvoostis as Bj D f s sin js ds These integrals can be estima
37. result is the estimated feature scores The selection is performed in agreement with properties of individual technique as described in Chapter 4 In particular for the sample input 24 DAVADAN CHAPTER 6 SELECTION OF TECHNIQUE AN EADS COMPANY Sample input BB Input YES Feature Scores YES YES gin Figure 6 1 The GTIVE internal decision tree for the choice of default estimation method 25 DATADVANCE CHAPTER 6 SELECTION OF TECHNIQUE AN EADS COMPANY e If p lt 10 K lt 300 and 2p 2 lt K lt 2 2p 3 RidgeFS is selected e If p lt 10 K lt 300 but K gt 2 2p 3 SMBFAST is selected e In other cases Mutual Information is selected which uses the Histogram technique if K gt 500 and accelerated histogram estimate if K gt 20000 and Kraskov if 20 lt K lt 500 For the black box input e If K gt 4 725 1 then the FAST technique is chosen e If2 p 1 lt K lt 4 72p 1 then the Elementary Effects is used e If p 1 lt K lt 2 p 1 the tool will start only if score variance estimation is not required by user see option VarianceEstimateRequired Otherwise if variance estimation is required or K lt p 1 the tool will not start 26 DYN DVN 0 AN EADS COMPANY Chapter 7 Usage Examples In this section we will apply GTIVE to some artificial model functions and some real world data sets to demonstrate method properties 7 1 Artificial Exa
38. ses in all algorithms are started with some fixed seed ensuring result to be the same on every run In the current version the switch affects only black box based techniques FAST and Elementaty Effects e Seed Values integer 1 2147483647 Default 100 Short description change fixed seed when Deterministic is on Description Enables user to use different fixed seeds for IVE process In the current version the switch affects only black box based techniques FAST and Elementaty Effects e MinCurveNum Values integer 1 2147483647 Default 200 Short description number of space filling curves tested to compute elementary ef fects Also see Section 2 4 2 Description Option specifies number of curves to be used in estimation of elementary effects The more curves is used the better parameter space is explored resulting in more accurate scores estimation however it takes additional time 4 6 Extended FAST Fourier Amplitude Sensitivity Test ing Short name FAST General description Variance based estimation of feature scores Methods can estimate cross variable interactions as well as isolated main variable indices which can be useful to some additional manual dependency analysis 12 Also see Section 2 4 2 Strengths and weaknesses Needs large enough computational budget number of func tion calls at least 65 calls per feature to get stable estimate however is very precise if the budget is enou
39. sks e In the Surrogate Model SM construction it may be beneficial to remove the least important features because less features mean more dense sample and denser sample may provide more accurate approximation Also many SM construction techniques may work better in smaller dimensions in terms of time memory requirements e In the Design of Experiment knowing what features influence dependency the most one can plan the sample generation in a way that most important features have the highest variability Also if data is obtained as some physical measurements knowing feature scores may tell what input variables should be measured with the highest accuracy e In the Optimization when the number of allowed function calls budget is limited knowing what features are less important allows for not changing them in the opti mization process Reducing number of variables by not considering features that have little effect on the dependency one can do more optimization iterations with the same budget possibly acquiring better solution Examples of GTIVE applications to the mentioned above tasks are presented in the Chapter 7 In this chapter the sensitivity analysis problem statement is given and short review of the state of the art methods used in the tool is provided 2 1 Problem statement The problem of the global sensitivity analysis is to estimate how variations in the output of the model can be attributed to the variations in the model
40. so one may notice that RF STR is independent from feature Fi 4 To validate the results of the GTIVE we used approximation error ratio measure 2 2 of RF STR Results of this experiment are presented in Table 7 15 and show that error of approximation are in agreement with the values of feature scores estimated by GTIVE F F2 F G GT IVE mean std mean std mean std mean std 50 pnts 0 0 0 0477 0 0207 0 2713 0 0704 0 0732 0 0377 300 pnts 0 0 0 0602 0 0107 0 2889 0 0216 0 0703 0 008 1000 pnts 0 0 0 0624 0 0041 0 286 0 0129 0 0715 0 0049 Go Gs G4 Gs GT IVE mean std mean std mean std mean std 50 pnts 0 062 0 0197 0 1056 0 0354 0 3323 0 0836 0 1079 0 0233 300 pnts 0 0673 0 0074 0 1001 0 0123 0 3135 0 0296 0 0998 0 0157 1000 pnts 0 0674 0 0037 0 1043 0 0062 0 3089 0 0136 0 0996 0 0058 Table 7 14 Stringer stress analysis Feature scores estimated by GTIVE e Results GTIVE showed that RF STR value is independent of the feature F GTIVE using as few points as possible was able to estimate reliably relative importance of each feature 35 DATADVANCE CHAPTER 7 USAGE EXAMPLES AN EADS COMPANY F F gt F Gi GT IVE Score 0 0 0624 0 286 0 0715 Approx error if fixing feature full model error 0 98 2285 126 04 29 85 G2 G Gy G GT IVE Score 0 0674 0 1043 0 3089 0 0996 Approx error if fixing feature full model error 29 83 45 72 14246 43 63 Table 7 15 Stringer stress analysis Approximation error ratio
41. t Auto Short description Specify the algorithm for the internal approximator used in SMB FAST Description Since SMBFAST builds a surrogate model to be used as a FAST black box it actually uses GT Approx internally and makes certain options of this internal approximator available as GTIVE options This option is essentially the same as GTApprox Technique Default Auto selects a technique according to the GTApprox decision tree with a single difference HDAGP is never selected automatically and where GTApprox would select HDAGP the GP technique is used instead 4 5 Elementary Effects Short name EE General description A screening technique estimating feature scores as an average of the function partial derivatives 13 Also see Section 2 4 2 Strengths and weaknesses Can work with very small budgets and still give reliable estimates in most cases however may take time if the budget is big due to complex prob lem of selecting appropriate set of trajectories Note that method actually allows some randomization so one can get different estimates by varying global Seed parameter Variance estimation Yes 19 DATADVANCE CHAPTER 4 USER CONFIGURABLE OPTIONS AN EADS COMPANY Restrictions Can be applied to the black box only Options e Deterministic Values boolean Default on Short description require IVE process to be deterministic Description If this switch is turned on then all random proces
42. t becomes computationally expensive and so can t be used for large data samples Histogram based estimate may be crude on small samples but is very cheap in terms of memory and computation time so it can be applied to a very large data sets In more details Kraskov estimate is an estimation of Mutual Information technique based on nearest neighbor approach The technique provides good accuracy for small and moderate sample sizes but becomes very computationally expensive in case of large samples Define a metric in space Z X Y as p Z Z max p X X py Y Y where p X X is the Euclidean norm in the X space and p Y Y is the Euclidean norm in the Y space Let k be the algorithm parameter setting number of nearest neighbors in the Z space then let elj p Z k th neighbor of Z 2 6 T DATADVANCE CHAPTER 2 OVERVIEW AN EADS COMPANY We set nj and n as number of points in the X and Y spaces correspondingly whos distance to X and Y is smaller than e In 8 it s shown that Iy 25 9 Wk Vs 1 Yny 1 Y K 2 7 where is the sample mean k is the number of nearest neighbors algorithm parameter W 2 is Euler digamma function Histogram based estimate is an estimate of Mutual Information technique using histogram based pdf estimation Method may be less accurate than previous one in case of small and moderate samples but can handle very large data sets
43. ted using points generated on the curve 2 12 In this case e g conditional variance can be estimated as K VP 2y 45 By ju is an integer 2 13 j l where K is some predefined number Another appealing property of this approach is it s ability to accurately estimate total indices In this case all cross variable interactions that include i th feature are taken into account in the corresponding scores i e the score is estimated as follows Vag Ez Y 1 vee ii Ti Zp m VY 2 14 To do this estimation unique frequency v is given to x and the same frequency v is given to all other features then the same procedure as above is performed The score for i th feature is Pros Can give main effect as well as total effect estimations Needs less samples than for most of other variance based approaches about 72 points per feature is recommended Cons Still requires relatively large samples What technique to choose in each case is decided by the initial problem conditions we have sample or black box and best practice For details see Chapter 6 10 DATADVANCE CHAPTER 2 OVERVIEW AN EADS COMPANY 2 5 Scores variance estimation It s possible to compute score estimation variances to check how reliable obtained scores values are When one obtains score and estimation of variance one may expect that there is high prob ability usually estimated at 99 99966 of the true score v
44. th lower GT IVE scores to subset Index of variability gt T 0 t o 20 40 60 80 100 120 140 160 180 Number of features selected Figure 7 2 T AXI Index of Variance Note This image was obtained using an older MACROS version Actual results in the current version may differ Stage Parameter 1 2 3 4 5 6 7 8 9 10 Stage rotor inlet angle deg 10 3 13 5 15 8 18 19 2 19 3 16 3 15 13 6 13 4 Stage rotor inlet Mach no 0 59 0 51 0 475 0 46 0 443 0 418 0 402 0 383 0 35 0 313 Rotor loss coef 0 053 0 0684 0 0684 0 0689 0 069 0 069 0 069 0 069 0 069 0 07 Stator loss coef 0 07 0 065 0 065 0 06 0 06 0 065 0 065 0 065 0 065 0 1 Rotor Solidity 1 666 1 486 1 447 1 38 1 274 1 257 1 31 1 317 1 326 1 391 Stator Solidity 1 353 1 277 1 308 1 281 1 374 1 474 1 379 1 276 1 346 1 453 Stage Exit Blockage 0 963 0 956 0 949 0 942 0 935 0 928 0 921 0 914 0 907 0 9 Stage bleed 0 0 0 0 1 3 0 2 3 0 0 0 Rotor Aspect Ratio 2 354 2 517 2 33 2 145 2 061 2 028 1 62 1 417 1 338 1 361 Stator Aspect Ratio 3 024 2 98 2 53 2 21 2 005 1 638 1 355 1 16 1 142 1 106 Rotor Axial Velocity Ratio 0 863 0 876 0 909 0 917 0 932 0 947 0 971 0 967 0 98 0 99 Rotor Row Space Coef 0 296 0 4 0 41 0 476 0 39 0 482 0 515 0 58 0 64 0 72 Stator Row Space Coef 0 3 0 336 0 438 0 441 0 892 0 455 0 886 0 512 0 583 0 549 Stage Tip radius m 0 3507 0 3358 0 3283 0 3212 0 3151 0 3084 0 3042 0 2995 0 297 0 2946 Table 7 12 T AXI Fea
45. tions Variance estimation Yes Restrictions Can be applied to data sample only 17 DAVADAN z CHAPTER 4 USER CONFIGURABLE OPTIONS AN EADS COMPANY Options e RankTransform Values on off Default on Short description Apply rank transform copula transform before computing mu tual information Description If this option is on True rank transform is applied to the input sample before computing mutual information In most cases it allows for a more accurate mutual information estimate 4 4 SMBFAST Surrogate Model Based FAST Short name SMBFAST General description Surrogate Model Based FAST combines the surrogate modelling and usage of extended FAST method Also see Section 2 4 1 Strengths and weaknesses SMBFAST may be time consuming but it is the most ac curate of all currently implemented sample based techniques Variance estimation Yes Restrictions Can be applied to data sample only Options e Accelerator Values integer in range 1 5 or 0 auto Default 0 automatically set by the approximator Short description Five position switch to control trade off between speed and ac curacy for the internal approximator used in SMBFAST Description Since SMBFAST builds a surrogate model to be used as a FAST black box it actually uses GT Approx internally and makes certain options of this internal approximator available as GTIVE options This option is essentially the same as GTApprox Ac
46. tively the user can directly specify the technique through advanced options of the tool This section describes the available techniques and it s options selection of the technique in a particular problem is described in Chapter 6 4 1 RidgeFS Short name LR General description Estimation of feature scores as normalized coefficients of regular ized linear regression Regularization coefficient is estimated by minimization of generalized cross validation criterion 5 Also see Section 2 4 1 Variance estimation Yes Restrictions Can be applied to data sample only Strengths and weaknesses A very robust and fast technique with a wide applicability in terms of the input space dimensions and amount of the training data It is however usually rather crude and the estimation can hardly be significantly improved by adding new training data Options No options 4 2 Mutual Information Kraskov estimate Short name Kraskov General description Mutual information estimate of feature scores based on the nearest neighbors information 8 Also see Section 2 4 1 16 DYN DVN Ol CHAPTER 4 USER CONFIGURABLE OPTIONS AN EADS COMPANY Variance estimation Yes Strengths and weaknesses Is a robust nonlinear estimation technique however can be applied only to small moderate samples due to memory limitations Method tends to underscore features in case of heavy cross feature interactions Restrictions Can be applied to
47. tures that influence Compressor Pressure Ratio the most a Mass Flow Rate kg s 54 4 Inlet Total Pressure Pa 101325 Mach 3 Last Stage 0 272 Clearance Ratio 0 0015 Table 7 13 T AXI Features that influence Compressor Pressure Ratio the most b DYN VANN 0 CHAPTER 7 USAGE EXAMPLES AN EADS COMPANY 7 2 2 Stringer Super Stiffener Stress Analysis problem e Problem description Special tool for Stress Analysis build upon a physical model computes Reserve Factors RFs constraints for a side panel of an airplane defined by its geometry G j 1 5 and applied forces F i 1 2 3 1 4 Our task here is to check whether all inputs equally influence the output RFs In particular the case of stringer RF RF STR is considered e Solution workflow 1 We have a code that can compute RFs for the given point so we may use black box technique 2 We estimate feature scores with default settings and various budget and Seeds see 4 5 for details to check what size of budget for GTIVE gives reliable esti mates and how stable the estimates are Elementary Effects technique was taken by default see Section 4 5 3 Results for different budget sizes are presented in the Table 7 14 For each budget size 10 runs with different seeds were made to estimate standard deviation of results One can see that mean estimates are already quite reliable on 50 points and variance of the results reduces as sample size increase Al
48. uence of the variable on the output taking into account all possible interactions between the considered feature and other input features but it s estimate is generally less reliable NumberOfSearchCurves Values integer 0 2147483647 Default 0 0 means auto selection 4 if the budget is sufficient and less otherwise Short description adds random multistart to FAST curves used for estimation of sensitivity indexes Description Option allows performing multistart when building FAST space filling curves It can potentially increase accuracy at the cost of increasing the bud get requirements NumberOfSearchCurves times Minimal allowable budget is equal to 65 p NumberOfSearchCurves where p is the effective dimension of input vector the number of not constant input factors 21 BLAND VANN 0 AN EADS COMPANY Chapter 5 Limitations The maximum size of the training sample which can be processed by GTIVE is primarily determined by the user s hardware Necessary hardware resources depend significantly on the specific technique see descriptions of individual techniques Accuracy of estimation tends to improve as the sample size increases Technique Input type Performance on huge Other restrictions training sets RidgeFS sample linear dependencies only Kraskov sample limited by available RAM Histogram sample SMBFAST sample potentially long runtime EE blackbox potentially long runt
49. utual Information histogram 3 3 EE 2 p 1 1 FAST 65 3 or 65p or 655 NR Table 5 2 Minimum sample size blackbox budget for GTIVE techniques 23 Corresponding DYN ND VANN 0 AN EADS COMPANY Chapter 6 Selection of technique This section details manual and automatic selection of one of the techniques described in Chapter 4 6 1 Selection of the technique by the user The user may specify the technique by setting the option Technique which may have the following values e Auto best technique will be determined automatically default e RidgeFS e Mutual Information to select specific estimation type additional parameter MutualInformation Algorithm may be specified having possible values kraskov for Kraskov estimation or hist for histogram based approach If none is specified than kraskov estimate is used if there is lt 500 sample points and hist estimate is used oth erwise If hist estimate is used and the sample size is at least 20000 then accelerated optimization of hist parameters is used bi e SMBFAST e ElementaryEffects e FAST 6 2 Default automatic selection The decision tree describing the default selection of the estimation technique is shown in Figure 6 1 The factors influencing the choice are e Input type i e sample or blackbox e Sample size for blackbox budget K and effective input dimension p of the training sample The
50. wing measures could be introduced These are intuitive straightforward ways to check the variable importance however huge amount of data or time is required to evaluate them so these measures are not very suited for practical use and are mostly useful as reference in the benchmarking of different sensitivity analysis methods e Index of variability may be used to compare importance of the features or even feature subsets if we can calculate dependency value in a given point Let features in the vector X be split into two subsets X Z X U X where the subvector Z X contains all important features features with high scores and U X contains all unimportant features features with low scores Let us define by R X Z X Up some vector where all unimportant features are fixed to some average values Then the Index of Variability can be computed as follows wy VU KOO gt max f X min f X where lt gt max min are some test sample mean maximum and minimum The higher index of variability the less important features are chosen in Z and the more important are fixed in U r 100 2 1 e Approximation error ratio Another way to estimate i th feature importance is to build an approximation surrogate model fam Z X where Z X 1 Gi 1 Vi41 Lp i e input formed from X using all features except i th and compare it s accuracy with approximation fsu X built using all features So t
51. x 40 Index Options 41 DYN DVN Ol AN EADS COMPANY List of Figures 2 1 The Newton s law of universal gravitation 3 6 1 The internal decision tree 2 vk a bee me ee ORE FE HOE GR Re sad 25 7 1 T AXI Feature scores estimated by the GTIVE 33 72 T AXI Index of Variance s4 x 04eyvexseyala el ke ee 34 iv DYN DVN Ol AN EADS COMPANY List of Tables 2 1 Illustration Scores for the Newton s law of universal gravitation problem 4 2 2 Pearson s and Spearman s correlation coefficients and GTIVE techniques 12 5 1 Technique SUMMIT f gto ud we me S dk See doe ker Be ie eae Bele ad 22 5 2 Minimum sample size blackbox budget for GTIVE techniques 23 7 1 Example 1 RidgeFS scores 27 7 2 Example 1 EFlementaryFflectsscores 28 7 3 Example 1 Mutual Information Kraskov estimate scores 28 7 4 Example 1 Mutual Information histogram estimate scores 28 7 5 Example 1 FAST scores 52 lt 4 4 1a dar save eek KE Ga 29 7 6 Example 2 GT IVE scores and the standard deviation of scores 30 7 7 Example 2 FAST total scores lt as oase 6448 SEG Sea ka 31 7 8 Example 2 FAST main scores saks SVA TALEN AK SEG sg 31 7 9 Stage data for 10 stage design stage e3c des 32 7 10 Initial data for 10 stage design init e3c des

MACROS GTIVE Generic Tool for Important Variable

Contents

Download Pdf Manuals

Related Search

Related Contents