30 November 2001 Kevin Karplus The "stride-ehl" directory contains networks, scripts, and quality reports for neural nets attempting to predict the secondary structure defined by STRIDE, reduced to a 3-letter alphabet, from a multiple alignment. Most of the scripts are still set up to refer to an older subdirectory organization (in which the subdirectories of testing/stride-ehl/ were just subdirectories of testing/). The following sections are old reports on the quality of the different networks. We have moved away from doing EHL predictions, in favor of the larger alphabet EBGHTL, The best network so far for EHL2 is overrep-2500-IDaa13-7-10-11-10-11-6-5-ehl2-seeded-stride-trained.net (3419 parameters) There are also a few networks here for the EHTL2 alphabet, but not much work was done with this 4-state alphabet---we jumped to the 6-state EBGHTL alphabet instead. ------------------------------------------------------------ Quality reports dunbrack-395-IDaa13-9-6-11-9-3-8-7-ehl-seeded* accidentally got stomped on by running the wrong script. Final result was about 0.849 bits/column, Q3 about 0.7564, SOV around 0.7243 As of 1 Jan 2000, the best network is either t99-2877-IDaa13-5-8-7-10-5-9-13-seeded-trained or t99-2877-IDaa13-5-8-7-10-5-9-11-seeded-trained Both get just under 0.8 bits/col, 76.95+-0.01% correct, SOV=0.7335+-0.0003. The one with the 11-wide window in the last layer can be trained more quickly to better performance (machines were taken down before storing a network with 0.7990 bits/col, 76.94% correct, SOV=0.7354). Note: t99-2877-IDaa13-5-8-7-10-5-9-11-seeded2-trained did not make much imporvement, probably because it was trained with too low a learning rate (from params/31-dec-99) and couldn't bounce out of the local minimum. 6 Feb 2000 The best network is now overrep-2260-IDaa13-5-8-7-10-5-9-11-ehl-seeded3-trained.net and some train/test experiments have been done train-on test-on Bits Q3 SOV object SOV_E SOV_H SOV_C overrep-2260 overrep-2260 0.7693 0.7781 0.7437 -0.3806 0.7664 0.7560 0.7168 fssp-1929-1 fssp-1929-1 0.7749 0.7766 0.7428 -0.3731 0.7675 0.7474 0.7181 fssp-1929-2 fssp-1929-2 0.7710 0.7772 0.7428 -0.3775 0.7761 0.7538 0.7162 fssp-1929-1 fssp-1929-2 0.7981 0.7700 0.7350 -0.3394 0.7749 0.7460 0.7062 fssp-1929-2 fssp-1929-1 0.8007 0.7689 0.7324 -0.3344 0.7532 0.7295 0.7143 Since fssp-1929-1 has 210699 columns and fssp-1929-2 has 215797 columns, the train/test summary for fssp-1929 0.7994 0.7694 0.7337 20 Feb 2000 Retrained a neural net that lacked the INSERT/DELETE inputs on the overrep-2260 training set. train-on test-on Bits Q3 SOV object SOV_E SOV_H SOV_C overrep-2260 overrep-2260 0.7779 0.7752 0.7370 -0.3658 0.7692 0.7427 0.7024 Note: the INSERT/DELETE inputs seem to improve the performance by less than 0.009 bits/column. (Note: the no-insert-delete network also had clipping on the sequence weights---perhaps inserts and deletes should be added back to this network and the first layer heavily retrained.) I took a version of the no-insert/delete network and tried training it on single-sequence input: train-on test-on Bits Q3 SOV object SOV_E SOV_H SOV_C overrep-2260 overrep-2260 1.0317 0.6898 0.6433 0.0203 0.6502 0.6583 0.6441 69% correct for single-sequence input is not particularly impressive, particularly compared with the 77.8% correct when using t99 alignments. I wonder whether the single-sequence network would provide a good starting point for retraining with alignment input? I also wonder whether I can push the 77.8% higher by using t2k alignments, when I get them finished, as there is some evidence that the new posterior decoding alignments are better alignments. 23 Feb 2000 Adding weight clipping to insert/delete networks seems to have gotten new best---now I have to see how stringent a clipping works best: train-on test-on Bits Q3 SOV object SOV_E SOV_H SOV_C overrep-2260 overrep-2260 0.7661 0.7788 0.7451 -0.3852 0.7756 0.7552 0.7154 Other checks to make include training on guide vs. t99, testing on guide vs. t99. Here are the tests done on overrep-2260 using overrep-2260-guide-aa13-5-8-7-10-5-9-11-ehl-seeded-trained.net seeded from overrep-2260-aa13-5-8-7-10-5-9-11-ehl-seeded2-trained.net train-on test-on Bits Q3 SOV object SOV_E SOV_H SOV_C t99 t99 0.7801 0.7747 0.7374 -0.3634 0.7702 0.7462 0.7006 t99 guide 1.1004 0.6686 0.6205 0.1216 0.6515 0.6161 0.6366 guide guide 1.0317 0.6898 0.6433 0.0203 0.6502 0.6583 0.6441 guide t99 0.8377 0.7572 0.7185 -0.2787 0.7248 0.7243 0.6865 Still to do: Try adjusting ClipExponent to 0.7 (0.5 is too small). Try adjusting bits to save up and down (1.2 and 1.4) Try throwing out third layer of network, retraining output layer. --DONE 27 Feb Try adding hidden units to third layer (in progress) and first layer. --DONE 27 Feb Take best of the "aa13" networks and add insert/delete weights from best IDaa13 network, then retrain. Try modifying training so that sequence of data is not scrambled after new best value, and is only scrambled part of the time when the new value is better than the previous. --DONE 28 Feb 25 Feb 2000 Adding an extra hidden unit to the 3rd layer to improve E/L distinction did not seem to improve overall performance overrep-2260-aa13-5-8-7-10-5-10-11-ehl-seeded.quality vs. overrep-2260-aa13-5-8-7-10-5-9-11-ehl-seeded4.quality Best current network is overrep-2260-IDaa13-5-8-7-10-5-9-11-ehl-seeded5-trained.net, which gets its improvement from setting the ClipExponent down to 0.8. The best version still doesn't get 78% right, but is getting close: train-on test-on Bits Q3 SOV object SOV_E SOV_H SOV_C overrep-2260 overrep-2260 0.7645 0.7792 0.7461 -0.3878 0.7753 0.7512 0.7194 26 Feb 2000 Further training on overrep-2260-IDaa13-5-8-7-10-5-9-11-ehl-seeded5 gets slight further improvement: train-on test-on Bits Q3 SOV object SOV_E SOV_H SOV_C overrep-2260 overrep-2260 0.7629 0.7796 0.7455 -0.3894 0.7720 0.7518 0.7199 The quality seems to come much more from the alignments than from the network, as a three-layer network can be quickly trained to do almost as well as the heavily trained 4-layer network, even though the number of weights was dropped to 1851 (overrep-2260-IDaa13-5-8-7-10-13-ehl-seeded): train-on test-on Bits Q3 SOV object SOV_E SOV_H SOV_C overrep-2260 overrep-2260 0.7831 0.7744 0.7374 -0.3600 0.7688 0.7351 0.7165 I'll try adding another hidden unit on the first layer and retraining the three-layer network more heavily. 28 Feb 2000 Three-layer network with 9 hidden units in first layer: train-on test-on Bits Q3 SOV object SOV_E SOV_H SOV_C overrep-2260 overrep-2260 0.7806 0.7752 0.7392 -0.3641 0.7675 0.7398 0.7175 3 March 2000 Retraining the best network with ClipExponent changed from 0.8 to 0.7 produced very slight improvement after long (650 epoch) training: train-on test-on Bits Q3 SOV object SOV_E SOV_H SOV_C overrep-2260 overrep-2260 0.7631 0.7795 0.7462 -0.3895 0.7734 0.7537 0.7187 8 March 2000 Increasing the bits to save to 1.4 and using ClipExponent 0.8 made tiny improvements (after a lot of training): train-on test-on Bits Q3 SOV object SOV_E SOV_H SOV_C overrep-2260 overrep-2260 0.7627 0.7797 0.7462 -0.3902 0.7748 0.7523 0.7185 Of the 2260 sequences, 1343 "couldn't save exactly 1.4 bits/position" (some because of the ClipExponent clipping, some just lack of convergence). Six sequences had more than 1.41 bits saved, and 1060 had less than 1.39 bits, with some of the singleton sequences having as low as 0.649 bits. 3tss was the largest with # 66 sequences, total weight= 20.3291 avg weight= 0.308017 # AdjustWeights couldn't save exactly 1.4 bits/position, saving 1.59326 bits. 1sbwI was the smallest with # 1 sequences, total weight= 1 avg weight= 1 # AdjustWeights couldn't save exactly 1.4 bits/position, saving 0.649246 bits. 9 Jan 2001 Updated the quality reports and unit usage. The quality reports now have bits_saved as the third column, allowing comparison between different alphabets. The objective is now something to be maximized, rather than minimized. The unit usage previously had a bug in reporting E(Phat(i)P(j)) / E(P(j)), which has now been fixed. It looks like having a richer alphabet makes for more informative predictions, though the Q measure drops: alphabet bits_saved Q_n SOV(E) SOV(H) EHL2 0.7815 0.7810 0.7337 0.7801 EHTL2 0.8299 0.6842 0.7592 0.7847 EBGHTL 0.9065 0.6667 0.8090 0.8646 It is interesting that splitting L into T and L improves SOV(E) and SOV(H), though their definitions are unchanged. The split of E into EB and H into GH naturally improves the E and H SOV scores, since B and G are the hardest to predict. 23 Jan 2001 Using t2k-thin90 alignments, the best EHL2 network is now overrep-2500-IDaa13-7-10-11-10-11-6-5-ehl2-seeded-stride-trained.net (3419 parameters) which was built by adding an additional layer to the best EBGHTL network and retraining overrep-2500-IDaa13-7-10-11-10-11-ebghtl-seeded-stride-trained.net (3326 parameters) alphabet bits_saved Q_n SOV(E) SOV(H) EHL2 0.7980 0.7864 0.7331 0.7813 EBGHTL 0.9232 0.6712 0.8075 0.8642 I will add an extra final layer to the EBGHTL network and see if I can improve the EBGHTL savings some more.