README of artificial data sets "LJ-r.f.s" Name convention of LJ-r.f.s data files: -- LJ is for Lennard-Jones Potential, which is used as the basis for mimic intermolecular interactions for the generation of artificial data sets. -- r is the number of relevant features. -- f is the number of features (dimensions). -- s is the number of diffent scale factors used for the relevent features. Naming convention for supplemental files: (the listed are the suffix for corresponding files) *.best: the feature values of the target concept; *.scale: the value of scale factors for features; *.sigma: sigma value on each feature (useful for data generation); *.structRcpt: "structure" of the target concept; *.labelMol: lables of all the instances in each molecule. (I) data set: "LJ-r.166.s/" Notes: (1) This data set is generated on the basis of "Lennard-Jones potential" for intermolecular interactions. This data set is used to mimic the "Musk1" data. (2) # training examples per data set = 92; # positive examples = 47; # negative examples = 45; # instances per bag = 3 to 5; # attributes (features) = 166; Label_max = 1.0; instance with label 1.0 has only the values of relevent features equal to those of the target concept; Data files: --------------------------------------------------------------- (1) filename: LJ-160.166.1-S There are 92 molecules: the first 47 possessed label > 0.6 (mimic the + molecules in MUSK1); the next 45 possessed label < 0.3 (mimic the - molecules in MUSK1). for (int i = 0; i < 160; i++) rcpt->s_factor[i] = 0.9 + 0.1*rand(); for (int i = 160; i < 165; i++) rcpt->s_factor[i] = 0.0; ---------------------------------------------------------------- (2) filename: LJ-160.166.1 There are 92 molecules: the first 47 possessed label >= 0.5 (mimic the + molecules in MUSK1); the next 45 possessed label < 0.5 (mimic the - molecules in MUSK1). Labels of molecules more uniformly distributed with values of [0,1] rcpt->s_factor[i] is the same as that of LJ-160.166.1-S. ---------------------------------------------------------------- (3) filename: LJ-80.166.1-S There are 92 molecules: the first 47 possessed label > 0.6 (mimic the + molecules in MUSK1); the next 45 possessed label < 0.3 (mimic the - molecules in MUSK1). for (int i = 0; i < 80; i++) rcpt->s_factor[i] = 0.9 + 0.1*rand(); for (int i = 80; i < 165; i++) rcpt->s_factor[i] = 0.0; ---------------------------------------------------------------- (4) filename: LJ-80.166.1 There are 92 molecules: the first 47 possessed label >= 0.5 (mimic the + molecules in MUSK1); the next 45 possessed label < 0.5 (mimic the - molecules in MUSK1). rcpt->s_factor[i] is the same as that of LJ-80.166.1-S. ---------------------------------------------------------------- ########################################################################### (II) data set: "LJ-r.283.s/" (1) This data set is generated on the basis of "Lennard-Jones potential" for intermolecular interactions. This data set is used to mimic the "Affinity" data provided by CombiChem. (2) # training examples per data set = 200; # instances per bag = 3 to 5; # attributes (features) = 283; Label_max = 1.0; instance with label 1.0 has only the values of relevent features equal to those of the target concept; (3) Different levels of relevancy of features were applied to different data sets to test the effect of levels of relevancy on the multiple instance learning algorithm [@ dataset (1) to (5) below]. To test with fixed levels of relevancy, the effect of the number or the ratio of the irrelevent to relevent features on the performance of the learning algorithm, dataset (6) to (14) were generated. --------------------------------------------------------------- i: index of feature vectors; For dataset (1) to (4): for (int i = 150; i < 283; i++) rcpt->s_factor[i] = 0.0; --------------------------------------------------------------- (1) filename: LJ-150.283.2 case 1: // 2 levels of relevency for (int i = 0; i < 50; i++) rcpt->s_factor[i] = 1.0; for (int i = 50; i < 150; i++) rcpt->s_factor[i] = 0.5; ---------------------------------------------------------------- (2) filename: LJ-150.283.4 case 2: // 4 levels of relevency for (int i = 0; i < 40; i++) rcpt->s_factor[i] = 1.0; for (int i = 40; i < 80; i++) rcpt->s_factor[i] = 0.75; for (int i = 80; i < 120; i++) rcpt->s_factor[i] = 0.5; for (int i = 120; i < 150; i++) rcpt->s_factor[i] = 0.25; ---------------------------------------------------------------- (3) filename: LJ-150.283.10 case 3: // 10 levels of relevency for (int i = 0; i<10; i++) for (int j = 0; j<15; j++) rcpt->s_factor[i*15 + j] = 1.0 - (float)(i*0.1); ---------------------------------------------------------------- (4) filename: LJ-150.283.15 case 4: // 15 levels of relevency for (int i = 0; i<15; i++) for (int j = 0; j<10; j++) rcpt->s_factor[i*10 + j] = 1.0 - (float)(i)/15.0; ---------------------------------------------------------------- ---------------------------------------------------------------- For datasets (5)-(11) There are 4 levels of relevency (with sf = 1.0, 0.75, 0.5, 0.25) and irrelevent features (with sf = 0.0). The # of features of each level are uniformly distributed among the relevent features. ---------------------------------------------------------------- (5) filename: LJ-40.283.4 (with 40 relevent features out of 283) (6) filename: LJ-80.283.4 (with 80 relevent features out of 283) (7) filename: LJ-120.283.4 (with 120 relevent features out of 283) (8) filename: LJ-160.283.4 (with 160 relevent features out of 283) (9) filename: LJ-200.283.4 (with 200 relevent features out of 283) (10) filename: LJ-240.283.4 (with 240 relevent features out of 283) (11) filename: LJ-280.283.4 (with 280 relevent features out of 283) ------------------------------------------------------------------------- ############################################################################ (III) data set: "LJ-r.30.s/" (1) This data set is generated on the basis of "Lennard-Jones potential" for intermolecular interactions. This data set is composed of examples with less number of features, to reduce the computaitonal cost, while still be able to mimic the behavior of the "Affinity" data provided by CombiChem. (2) # training examples per data set = 60; # instances per bag = 3 to 5; # attributes (features) = 30; ---------------------------------------------------------------------- i: index of feature vectors; for (int i = 0; i < 8; i++) rcpt->s_factor[i] = 1.0; for (int i = 8; i < 16; i++) rcpt->s_factor[i] = 0.5; for (int i = 16; i < 30; i++) rcpt->s_factor[i] = 0.0; --------------------------------------------------------------- ------------------------------------------------------------------------ (1) filename: LJ-16.30.2-T Label_max = 1.0; Instance with label 1.0 has all feature values equaling to those of target concept. ------------------------------------------------------------------------ (2) filename: LJ-16.30.2 Label_max = 1.0; Instance with label 1.0 has only relevent feature values equaling to those of target concept. ------------------------------------------------------------------------ (3) filename: LJ-16.30.2-0.9 Label_max = 0.9; ------------------------------------------------------------------------