RPM MANUAL VERSION: July 26th AUTHOR: Tsvika Klein tklein@im.wustl.edu --------------------------------------------------------------------------------- OPERATING SYSTEM --------------------------------------------------------------------------------- This version or RPM will only work on a Unix/Linux environment. The program was tested under Sun Fire 480R machine with OS Solaris 5.9 and compiled with gcc. INSTALLATION --------------------------------------------------------------------------------- In order to run the program you should follow these instructions: 1. Get the source file rpm.c 2. Make sure that you compile the file as follows gcc -lm [rpm file].c -o rpm.o Using the tag -o is optional. 2. Construct the input file, and parameter file as described below 3. run rpm as follows: rpm.o -input [inputfile] -output [outputflie] -param [paramfile] You can also use the default names without using the tags rpm.o is equal to rpm.o -input data -output data.out -param param Example: rpm.o -input data -output data.out -param param PARAMETER FILE FORMAT --------------------------------------------------------------------------------- Sample parameter file for 2 loci simulation: TRAITS=1 COV=0 GENO=10 1 1 4 2 3 3*5 Sample parameter file for 3 loci simulation: TRAITS=1 COV=0 GENO=10 1 1 2 3 2*5 The parameter file is divided into two section: 1. The required and optional tag values 2. The simulation properties The two sections are divided with an empty line. You MUST have an empty line between the two parts. Otherwise, your simulation will fail. SECTION 1: There are some required and optional tags. The order of the tags is not significant. Each tag has a the following format: TAGNAME=TAGVALUE The required fields are: TRAITS=[int value] // the number of traits in the data COV=[int value] // the number of covariates in the data GENO=[int value] // the number of genotypes in the data The optional fields are: MISS=[int value] // a number for a missing value PERM=[int value] // number of permutations COMB=[int value] // number of loci to consider together for a test ALPHA=[double value] // alpha value for the GH method PSEQ=[float value float value] // two numbers for the sequential testing TTYPE=[int value] // trait type ZERO=[int value] // determines if Rsq=0 is ignored for permutations SEED=[long value] // seed number SINGLE=[int value] // including a single sample when calculating Rsq The tags in more details: Required tags: TRAITS Number of "Traits" (quantitative measure) in the data file COV Number of "Covariates" (Non-genotypic, qualitative covariates such as sex, race, environmental exposure). GENO Number of genotypes (loci) Optional tags: MISS Integer value used in data file to indicate a "missing" value Default: -777 PERM Number of permutations Default: 1000 COMB Number of loci to consider together for a test. At the moment, it is possible to run combinations of two and three loci. Default: 2 loci ALPHA Alpha value for the GH method Default: 0.01 PSEQ Two numbers for the sequential testing. The sequential testing tests the p-value after 100 and 1000 permutations and checks that it is lower than the two specified values provided. The first value corresponds to the p-value requirement after 100 permutations and the second after 1000 permutations. If the p-value after 100 permutations is more than the value specified the program will not run more permutations. The same applies after 1000 permutations. However, the number of permutations will never exceed the value of PERM. Default: 0.1 0.01 TTYPE The type of "Traits" (0=qualitative, 1=quantitative (note: only quantitative traits are supported at this time.) Default: 1 ZERO Determines if tests with Rsq of 0 should be ignored for permutation testing (1 ? don't run permutations if Rsq is 0 (note: at this time only option 1 for this parameter is allowed) Default: 1 SEED Seed number (Integer: 0 generates a seed based on time, while >0 specifies the seed) Default: 0 SINGLE When calculating Rsq you might want to ignore any group with only one observation. This parameter allows you to do it. You should choose: 0 to remove;1 to keep untouched. (default 0) SECTION 2: This section specifies the simulation properties. Line 1: Identifies which traits should be used. If more than one, the traits should be separated with a space delimiter (e.g., 1 2 4 will run the analyzes sequentially on traits 1, then 2, then 4) Line 2-END: Specifies the covariate/genotype groups that should be examined. You may specify different tests in each line starting from line 2. There are two ways to specify the tests: 1. Separate the covariates/genotypes with a space delimiter For 2 loci simulation: "1 4" runs the combination of covariate/genotype 1 and 4, while "2 3" runs the combination of covariate/genotype 2 and 3. For 3 loci simulation: "1 2 4" runs the combination of covariate/genotype 1, 2 and 4. 2. Use the symbol '*' to specify all the combinations between the two numbers For 2 loci simulation: "3*5" runs the combination of covariate/genotype 3 and 4, 3 and 5, and 4 and 5) For 3 loci simulation: "2*5" runs the combinations of covariate/genotype 2 3 4, 2 3 5, 2 4 5, and 3 4 5. Important: covariates and genotypes are considered together for the groupings. You must put your covariate columns before your geneotype columns. When specifying the covariate/genotype groupings to be analyzed, the numbering starts at 1 and not the column number in the data. For example, the covariate/genotype combination 1 4 corresponds to the covariate in column 3 and locus in column 6 of the data file. DATA FILE FORMAT --------------------------------------------------------------------------------- The data file includes the following columns: 1. Index number 2. Values of traits (can have more than one trait) 3. Values for covariates/genotypes (can and will have more than one covariate) It is easier to illustrate the requirements for the input file using an example. Assume that you have a sample of 500 subjects with 1 quantitative measure of interest 1 covariate and 4 SNPs thought to be related to the quantitative measure. The txt file for this data, only for the first 5 subjects, is as follows: 100010 -133.39 M 1/1 1/1 1/2 1/1 100020 101.97 M 1/1 1/2 1/1 1/1 100030 140.78 M 1/2 1/2 1/1 1/1 100040 -691.19 M 1/1 1/1 1/2 1/1 100050 -508.87 M 1/1 1/1 1/2 1/2 column 1 <-- corresponds to index number column 2 <-- corresponds to trait value column 3-12 <-- correspond to the covariates/genotypes ID: The first column is the ID number to identify the subject. Trait: The second column is the quantitative measure. Each additional trait should be represented as additional columns following the first quantitative trait column. Covariates / Genotypes: The columns after the last quantitative trait correspond to the covariates and genotypes. In the data example above we have 1 covariate coded as M or F and 4 SNPs (genotypes) coded as 1/1, 1/2, and 2/2. The alleles for the genotypes can be represented with either numbers or letters. Important: 1. Covariates should always come before the genotypes 2. The order in which you represent genotypes is very important. For a locus with two possible alleles, coded 1 and 2, the data must represent them in order: 1/1, 1/2, and 2/2 are acceptable. Mixing 1/2 and 2/1 will result in each of these being viewed as different genotypes. You should always follow the same order as previous subjects (i.e. don't mix 1/2 and 2/1). OUTPUT FILE FORMAT --------------------------------------------------------------------------------- The RPM program generates three files. The name convention for these three files is: 1. [data.out] - which represents the Trait, Loci, Permutation number, seed value, and rsq value. 2. [data.out].table - which represents all the tests with their Rsq, P value, Missing samples, and groups. 3. [data.out].table2 - the original input file with added columns representing the group assignment for each subject in each of the tests. data.out is the name of the output file that you gave when you ran RPM using the option -output. It is easier to illustrate the output of RPM using the example described above. [data.out] - After running rpm using the input file in the example above a file [data.out] is generated. The output file of the 2 loci simulation has the following columns: 1. Trait number 2. Locus 1 number 3. Locus 2 number 4. Permutation number 5. Seed value 6. Rsq value Example of an output file: 1 1 4 0 1074714623 0.100070 1 1 4 1 1074714623 0.001000 . . . . . . . . . . . . 1 1 4 1000 1074714623 0.000000 1 2 3 0 1074714623 0.000000 1 3 4 0 1074714623 0.000000 1 3 4 0 1074714623 0.000000 The output file of the 3 loci simulation has the following columns: 1. Trait number 2. Locus 1 number 3. Locus 2 number 4. Locus 3 number 5. Permutation number 6. Seed value 7. Rsq value Example of an output file: 1 1 2 3 0 1074714623 0.029449 1 1 2 3 1 1074714623 0.001000 . . . . . . . . . . . . . . 1 1 2 3 1000 1074714623 0.000000 1 2 3 4 0 1074714623 0.000000 1 2 3 5 0 1074714623 0.000000 1 2 4 5 0 1074714623 0.000000 1 3 4 5 0 1074714623 0.030000 . . . . . . . . . . . . . . 1 3 4 5 100 1074714623 0.003000 [data.out].table - After running rpm using the input file in the example above a file [data.out].table is generated. The output file [data.out].table for 2 loci is as follows: T=1 L1=1 L2=4 G1: mean=139.734648 sd=121.2345 n=350 G2: mean=-773.000000 sd=145.2314 n=145 Missing=5 1 1 4 0.100070 0.001000 1 1/2_1/2 15 1 1/1_1/1 335 2 1/1_1/2 75 2 1/2_1/1 60 2 1/1_2/2 5 2 2/2_1/1 3 2 1/2_2/2 2 The file [data.out].table represents the results for all the tests. The example above is only a partial output, which corresponds to trait 1, and Covariates/Genotypes 1 and 4. The rest of the tests are missing to simplify the illustration. Each test is separated with a blank line and has the following information: Row 1: 1. Trait number 2. Locus 1 number 3. Locus 2 number Row 2-3: (number of rows depends on number of final groups) 1. Group 1 Mean, Standard Diviation, and Number of samples 2. Group 2 Mean, Standard Diviation, and Number of samples Row 4: Missing samples from the simulations due to missing values Row 5: 1. Trait number 2. First locus number 3. Second locus number 4. R square 5. P value Row 6 - Last: have the following columns: 1. Group assignment. In the example above there are two groups labeled as 1 and 2. 2. Label - the label is just concatenation of the two covariates/genotypes with the symbol '_'. 3. Number of subjects that have these covariates/genotypes The output file [data.out].table for 3 loci is as follows: T=1 L1=1 L2=2 L3=3 G1: mean=139.734648 sd=145.2345 n=355 G2: mean=-773.000000 sd=NaN n=1 G3: mean=-29.317721 sd=125.5325 n=136 G4: mean=741.150000 sd=NaN n=1 G5: mean=-431.010000 sd=NaN n=1 Missing=6 1 1 2 3 0.029449 Pval=<0.001 1 1/2_1/2_1/1 14 1 1/1_1/1_1/2 62 1 1/1_1/1_1/1 260 1 1/1_1/1_2/2 6 1 1/2_1/1_1/2 13 2 1/1_1/2_2/2 1 3 1/2_1/1_1/1 55 3 1/1_1/2_1/1 60 3 1/1_2/2_1/1 3 3 1/1_1/2_1/2 14 3 2/2_1/1_1/2 2 3 1/2_2/2_1/1 2 4 1/2_1/2_1/2 1 5 1/2_1/1_2/2 1 Note: In case only one sample creates a group sd=NaN to show that it doesn't exist. [data.out].table2 - The second output file that is generated is the original input file with added columns for each of the tests specified in the parameter file. Each column holds the group assignment of the subject for the specific test. The output file for the example described above (check input file) is as follows: ID T1 L1 L2 L3 L4 L5 T1_L1_L4 T1_L2_L3 T1_L3_L4 T1_L3_L5 T1_L4_L5 100010 -133.39 M 1/1 1/1 1/2 1/1 1 1 1 1 1 100020 101.97 M 1/1 1/2 1/1 1/1 1 2 1 1 1 100030 140.78 M 1/2 1/2 1/1 1/1 1 1 1 1 1 100040 -691.19 M 1/1 1/1 1/2 1/1 1 1 1 1 1 100050 -508.87 M 1/1 1/1 1/2 1/2 1 1 2 2 1 This is only an illustration. Therefore, it only includes the first 5 subjects. The first line is the header. It describes each of the columns. The name convention for the header is as follows: 1. Column 1 is the subject ID. 2. The next column(s) 2 hold the trait(s). The name convention for the traits always start with the letter T following by a number starting at 1 corresponding to the first trait in the data file, 2 correponding to the second trait, etc 3. The next columns after the traits hold all the covariates/genotypes. The name convention for the covariates/genotypes always starts with the letter L following a number starting at 1. Remember: the covariates and genotypes are grouped together starting with the covariates and immediately following with the genotypes. They are numbered as if they are the same thing starting with the number 1. 4. The next few columns represent the RPM tests and their results. Each column holds a name that represents the trait and the two/three loci used for the test. The name is separated by the symbol '_' (i.e. for a test with 2 loci: trait 1, locus1, locus 2 the name in the header would be T1_L1_L2). for a test with 3 loci: trait 1, locus1, locus 2, locus 3 the name would be T1_L1_L2_L3. The next rows after the header hold the information of each subject and the group assignment of that subject in each of the tests performed. ADDITIONAL INFORMATION --------------------------------------------------------------------------------- You may contact Rob Culverhouse at rob@ilya.wustl.edu for any additional information. Also, please check our website for updated information and/or new versions at: http://ilya.wustl.edu/~pgrn/rpm