30 November 2001 Kevin Karplus

The "stride-ehl" directory contains networks, scripts, and quality reports
for neural nets attempting to predict the secondary structure defined
by STRIDE, reduced to a 3-letter alphabet, from a multiple alignment.

Most of the scripts are still set up to refer to an older subdirectory
organization (in which the subdirectories of testing/stride-ehl/ were just
subdirectories of testing/).

The following sections are old reports on the quality of the different networks.
We have moved away from doing EHL predictions, in favor of the larger
alphabet EBGHTL, 

The best network so far for EHL2 is	
    overrep-2500-IDaa13-7-10-11-10-11-6-5-ehl2-seeded-stride-trained.net (3419 parameters)


There are also a few networks here for the EHTL2 alphabet, but not much
work was done with this 4-state alphabet---we jumped to the 6-state
EBGHTL alphabet instead.

------------------------------------------------------------
Quality reports dunbrack-395-IDaa13-9-6-11-9-3-8-7-ehl-seeded*
accidentally got stomped on by running the wrong script.
Final result was about 0.849 bits/column, Q3 about 0.7564, SOV around 0.7243

As of 1 Jan 2000, the best network is either
	t99-2877-IDaa13-5-8-7-10-5-9-13-seeded-trained 
or	t99-2877-IDaa13-5-8-7-10-5-9-11-seeded-trained 

Both get just under 0.8 bits/col, 76.95+-0.01% correct, SOV=0.7335+-0.0003.

The one with the 11-wide window in the last layer can be trained more
quickly to better performance (machines were taken down before storing
a network with 0.7990 bits/col, 76.94% correct, SOV=0.7354).

Note: t99-2877-IDaa13-5-8-7-10-5-9-11-seeded2-trained did not make
much imporvement, probably because it was trained with too low a
learning rate (from params/31-dec-99) and couldn't bounce out of the
local minimum.


6 Feb 2000

The best network is now
	overrep-2260-IDaa13-5-8-7-10-5-9-11-ehl-seeded3-trained.net
and some train/test experiments have been done

train-on	test-on		Bits	Q3      SOV     object  SOV_E   SOV_H   SOV_C   

overrep-2260	overrep-2260	0.7693  0.7781  0.7437 -0.3806  0.7664  0.7560  0.7168
fssp-1929-1	fssp-1929-1	0.7749  0.7766  0.7428 -0.3731  0.7675  0.7474  0.7181

fssp-1929-2	fssp-1929-2	0.7710  0.7772  0.7428 -0.3775  0.7761  0.7538  0.7162

fssp-1929-1	fssp-1929-2	0.7981  0.7700  0.7350 -0.3394  0.7749  0.7460  0.7062
fssp-1929-2	fssp-1929-1	0.8007  0.7689  0.7324 -0.3344  0.7532  0.7295  0.7143

Since fssp-1929-1 has 210699 columns and fssp-1929-2 has 215797 columns,
the train/test summary for fssp-1929

				0.7994	0.7694	0.7337

20 Feb 2000
Retrained a neural net that lacked the INSERT/DELETE inputs on the
overrep-2260 training set.

train-on	test-on		Bits	Q3      SOV     object  SOV_E   SOV_H   SOV_C   

overrep-2260	overrep-2260	0.7779  0.7752  0.7370 -0.3658  0.7692  0.7427  0.7024


Note: the INSERT/DELETE inputs seem to improve the performance by less
than 0.009 bits/column.  (Note: the no-insert-delete network also had
clipping on the sequence weights---perhaps inserts and deletes should
be added back to this network and the first layer heavily retrained.)


I took a version of the no-insert/delete network and tried training it
on single-sequence input:

train-on	test-on		Bits	Q3      SOV     object  SOV_E   SOV_H   SOV_C   
overrep-2260	overrep-2260	1.0317  0.6898  0.6433  0.0203  0.6502  0.6583  0.6441

69% correct for single-sequence input is not particularly impressive,
particularly compared with the 77.8% correct when using t99 alignments.
I wonder whether the single-sequence network would provide a good
starting point for retraining with alignment input?
I also wonder whether I can push the 77.8% higher by using t2k
alignments, when I get them finished, as there is some evidence that
the new posterior decoding alignments are better alignments.

23 Feb 2000

Adding weight clipping to insert/delete networks seems to have gotten
new best---now I have to see how stringent a clipping works best:

train-on	test-on		Bits	Q3      SOV     object  SOV_E   SOV_H   SOV_C   
overrep-2260	overrep-2260	0.7661  0.7788  0.7451 -0.3852  0.7756  0.7552  0.7154


Other checks to make include training on guide vs. t99, testing on
guide vs. t99.  Here are the tests done on overrep-2260 
using
overrep-2260-guide-aa13-5-8-7-10-5-9-11-ehl-seeded-trained.net
 seeded from
overrep-2260-aa13-5-8-7-10-5-9-11-ehl-seeded2-trained.net

train-on	test-on		Bits	Q3      SOV     object  SOV_E   SOV_H   SOV_C   
t99		t99		0.7801  0.7747  0.7374 -0.3634  0.7702  0.7462  0.7006
t99		guide		1.1004  0.6686  0.6205  0.1216  0.6515  0.6161  0.6366
guide		guide		1.0317  0.6898  0.6433  0.0203  0.6502  0.6583  0.6441
guide		t99		0.8377  0.7572  0.7185 -0.2787  0.7248  0.7243  0.6865

Still to do:
	Try adjusting ClipExponent to 0.7 (0.5 is too small).
	Try adjusting bits to save up and down (1.2 and 1.4)
	Try throwing out third layer of network, retraining output layer. 
		 --DONE 27 Feb
	Try adding hidden units to third layer (in progress) and first
		layer. --DONE 27 Feb
	Take best of the "aa13" networks and add insert/delete weights
		from best IDaa13 network, then retrain.
	Try modifying training so that sequence of data is not
		scrambled after new best value, and is only scrambled part of
		the time when the new value is better than the previous.
		--DONE 28 Feb

25 Feb 2000
	Adding an extra hidden unit to the 3rd layer to improve E/L distinction
	did not seem to improve overall performance
	overrep-2260-aa13-5-8-7-10-5-10-11-ehl-seeded.quality vs.
	overrep-2260-aa13-5-8-7-10-5-9-11-ehl-seeded4.quality

	Best current network is 
	overrep-2260-IDaa13-5-8-7-10-5-9-11-ehl-seeded5-trained.net,
	which gets its improvement from setting the ClipExponent down
	to 0.8.  The best version still doesn't get 78% right, but is getting close:

train-on	test-on		Bits	Q3      SOV     object  SOV_E   SOV_H   SOV_C   
overrep-2260	overrep-2260	0.7645  0.7792  0.7461 -0.3878  0.7753  0.7512  0.7194


26 Feb 2000
Further training on overrep-2260-IDaa13-5-8-7-10-5-9-11-ehl-seeded5
gets slight further improvement:

train-on	test-on		Bits	Q3      SOV     object  SOV_E   SOV_H   SOV_C   
overrep-2260	overrep-2260	0.7629  0.7796  0.7455 -0.3894  0.7720  0.7518  0.7199

The quality seems to come much more from the alignments than from the
network, as a three-layer network can be quickly trained to do almost
as well as the heavily trained 4-layer network, even though the number
of weights was dropped to 1851 (overrep-2260-IDaa13-5-8-7-10-13-ehl-seeded):

train-on	test-on		Bits	Q3      SOV     object  SOV_E   SOV_H   SOV_C   
overrep-2260	overrep-2260	0.7831  0.7744  0.7374 -0.3600  0.7688  0.7351  0.7165

I'll try adding another hidden unit on the first layer and retraining
the three-layer network more heavily.

28 Feb 2000

Three-layer network with 9 hidden units in first layer:
train-on	test-on		Bits	Q3      SOV     object  SOV_E   SOV_H   SOV_C   
overrep-2260	overrep-2260	0.7806  0.7752  0.7392 -0.3641  0.7675  0.7398  0.7175

3 March 2000
Retraining the best network with ClipExponent changed from 0.8 to 0.7
produced very slight improvement after long (650 epoch) training:

train-on	test-on		Bits	Q3      SOV     object  SOV_E   SOV_H   SOV_C   
overrep-2260	overrep-2260	0.7631  0.7795  0.7462 -0.3895  0.7734  0.7537  0.7187

8 March 2000

Increasing the bits to save to 1.4 and using ClipExponent 0.8 made
tiny improvements (after a lot of training):

train-on	test-on		Bits	Q3      SOV     object  SOV_E   SOV_H   SOV_C   
overrep-2260	overrep-2260	0.7627  0.7797  0.7462 -0.3902  0.7748  0.7523  0.7185

Of the 2260 sequences, 1343 "couldn't save exactly 1.4 bits/position"
(some because of the ClipExponent clipping, some just lack of convergence).
Six sequences had more than 1.41 bits saved, and 1060 had less than
1.39 bits, with some of the singleton sequences having as low as 0.649 bits.
3tss was the largest with 
# 66 sequences, total weight= 20.3291 avg weight= 0.308017
# AdjustWeights couldn't save exactly 1.4 bits/position, saving 1.59326 bits.
1sbwI was the smallest with 
# 1 sequences, total weight= 1 avg weight= 1
# AdjustWeights couldn't save exactly 1.4 bits/position, saving 0.649246 bits.


9 Jan 2001

Updated the quality reports and unit usage.
The quality reports now have bits_saved as the third column, allowing
comparison between different alphabets.
The objective is now something to be maximized, rather than minimized.

The unit usage previously had a bug in reporting E(Phat(i)P(j)) / E(P(j)), 
which has now been fixed.

It looks like having a richer alphabet makes for more informative
predictions, though the Q measure drops:
    alphabet	bits_saved	Q_n	SOV(E)	SOV(H)
	EHL2	0.7815		0.7810	0.7337	0.7801
	EHTL2	0.8299		0.6842	0.7592	0.7847
	EBGHTL	0.9065		0.6667	0.8090	0.8646

It is interesting that splitting L into T and L improves SOV(E) and
SOV(H), though their definitions are unchanged.  The split of E into
EB and H into GH naturally improves the E and H SOV scores, since B
and G are the hardest to predict.

23 Jan 2001

Using t2k-thin90 alignments, the best EHL2 network is now
	overrep-2500-IDaa13-7-10-11-10-11-6-5-ehl2-seeded-stride-trained.net (3419 parameters)
which was built by adding an additional layer to the best EBGHTL
network and retraining
	overrep-2500-IDaa13-7-10-11-10-11-ebghtl-seeded-stride-trained.net (3326 parameters)

    alphabet	bits_saved	Q_n	SOV(E)	SOV(H)
	EHL2	0.7980		0.7864	0.7331	0.7813
	EBGHTL	0.9232		0.6712	0.8075  0.8642

I will add an extra final layer to the EBGHTL network and see if I can
improve the EBGHTL savings some more.