Commit 25adc188 authored by dexiong chen's avatar dexiong chen

update for revised paper

parent 15306e87
s
results
scripts
experiments/scripts
......@@ -5,6 +6,7 @@ experiments/test*
logs*
data/*
!README.md
!simulation
.DS_Store
out*
......
This is the Pytorch implementation of CKN-seq for reproducing the results of the paper
>Dexiong Chen, Laurent Jacob, Julien Mairal.
[Predicting Transcription Factor Binding Sites with Convolutional Kernel Networks][4]. preprint BiorXiv. 2017.
[Biological Sequence Modeling with Convolutional Kernel Networks][4]. preprint BiorXiv. 2018.
CKN-seq is free for academic use only
© 2017 All rights reserved - Inria, CNRS, Univ. Grenoble Alpes
© 2018 All rights reserved - Inria, CNRS, Univ. Grenoble Alpes
If you intend to use the software in a non-academic context, please contact julien.mairal@inria.fr
## Small Remark
The implementation provided on this website has a minor difference compared to
the one used in the paper, but the prediction performance does not seem to be
affected. To conduct the experiments of the paper, we used a small part of the
code of the DeepBind software package to perform DNA sequence shuffling, when
generating sequences with negative labels. Since it was not clear to us that we
were allowed to redistribute this piece of code due to licensing issues, we
have recoded ourselves a different algorithm instead.
If you intend to use the software in a non-academic context, please contact dexiong.chen@inria.fr
## Installation
......@@ -28,7 +18,7 @@ We strongly recommend users to use [anaconda][1] to install the following packag
numpy
scipy
scikit-learn=0.19.0
pytorch=0.2.0
pytorch=0.3.0
biopython=1.69
pandas
matplotlib
......@@ -41,40 +31,32 @@ Then we install CKN-seq
pip install .
```
## Reproduction of results for ENCODE ChIP-seq datasets
## Reproduction of results for ENCODE ChIP-seq datasets.
Download data from [DeepBind website][3] and put encode fold to data/.
Download data from [DeepBind website][3] and put encode folder to data/.
#### Single-task model
#### Supervised training without perturbation
* One-layer model
If you want to reproduce the results for Figure 2 in the paper
If you want to reproduce the results for a one-layer CKN-seq
```
cd experiments
python train_encode.py --rc --num-layers 1 --num-motifs 16 --len-motifs 12 --subsamplings 1 --kernel-params 0.3 --epochs 120 --batch-size 128 --ntrials 1 --sup-train --mode all --pooling mean --outdir ../out
python train_encode.py --rc --num-layers 1 --num-motifs 16 --len-motifs 12 --subsamplings 1 --kernel-params 0.3 --epochs 100 --batch-size 128 --ntrials 1 --method sup --pooling mean --outdir ../out
```
You can add `--use-cuda` if you have cuda installed.
If you want to train and evaluate a model for _top_ mode, just replace `--mode all` with `--mode top` and set a smaller number of epochs such as `--epochs 100` and possibly a larger number of trials such as `--ntrials 3`.
If you want to use max pooling, just replace `--pooling mean` with `--pooling max` and set a lager sigma, such as `--kernel-params 0.6` and potentially larger patch size `--len-motifs 16`.
If you want to use max pooling, just replace `--pooling mean` with `--pooling max` and set a lager sigma, such as `--kernel-params 0.6`. This setting was used for training models on SCOP 1.67 datasets.
* Two-layer model
```
cd experiments
python train_encode.py --num-layers 2 --num-motifs 64 16 --len-motifs 12 3 --subsamplings 2 1 --kernel-params 0.3 0.6 --epochs 120 --batch-size 128 --ntrials 1 --sup-train --mode all --pooling mean --outdir ../out --rc
python train_encode.py --num-layers 2 --num-motifs 64 16 --len-motifs 12 3 --subsamplings 2 1 --kernel-params 0.3 0.6 --epochs 120 --batch-size 128 --ntrials 1 --method sup --pooling mean --outdir ../out --rc
```
#### Multitask model
```
python train_encode_multitask.py --num-layers 1 --num-motifs 16 --len-motifs 12 --subsamplings 1 --kernel-params 0.3 --epochs 100 --batch-size 128 --ntrials 3 --sup-train --mode top --pooling mean --outdir ../out --rc
```
#### Generation of logos
There are two ways of generating logos
......@@ -83,7 +65,7 @@ There are two ways of generating logos
2. After training a model and save to `../out` then you can run
```
python report_encode.py --rc --num-layers 1 --num-motifs 16 --len-motifs 12 --subsamplings 1 --kernel-params 0.3 --epochs 120 --batch-size 128 --ntrials 1 --sup-train --mode all --pooling mean --outdir ../out
python report_encode.py --rc --num-layers 1 --num-motifs 16 --len-motifs 12 --subsamplings 1 --kernel-params 0.3 --epochs 120 --batch-size 128 --ntrials 1 --method sup --pooling mean --outdir ../out
```
#### Unsupervised training
......@@ -91,11 +73,42 @@ There are two ways of generating logos
CKN-seq also provides an unsupervised training fashion, however the number of filters should be much larger to achieve similar performance as supervised training. We give an example of training a one-layer model here
```
python train_encode.py --rc --num-layers 1 --num-motifs 512 --len-motifs 12 --subsamplings 1 --kernel-params 0.3 --batch-size 200 --ntrials 1 --mode all --pooling mean --outdir ../out
python train_encode.py --rc --num-layers 1 --num-motifs 512 --len-motifs 12 --subsamplings 1 --kernel-params 0.3 --batch-size 200 --ntrials 1 --method unsup --pooling mean --outdir ../out
```
#### Augmented training and hybrid training with perturbation
supervised CKN-seq can achieve better performance on _small-scale_ datasets by augmenting training samples with mismatch noises. An example for training with adding 20% mismatch noise to training samples
```
python train_encode.py --rc --num-layers 1 --num-motifs 16 --len-motifs 12 --subsamplings 1 --kernel-params 0.3 --epochs 100 --batch-size 128 --ntrials 1 --method augmentation --pooling mean --outdir ../out --perturbation 0.2
```
It can be further improved using the hybrid model described in the paper
```
python train_encode.py --rc --num-layers 1 --num-motifs 16 --len-motifs 12 --subsamplings 1 --kernel-params 0.3 --epochs 100 --batch-size 128 --ntrials 1 --method semisup --pooling mean --outdir ../out --perturbation 0.2 --unsup-dir $PATH_TO_A_TRAINED_UNSUP_MODEL
```
## Reproduction of results for Zeng's motif occupancy datasets
Download data from [website][5] and put it to data/encode_occupancy/. Then check
```
python train_encode_occupancy --help
```
## Reproduction of results for SCOP 1.67 datasets
Download data from [jlstm website][6] and put it to data/SCOP167-superfamily/.
```
python train_scop.py --help
```
[1]: https://anaconda.org
[2]: http://pytorch.org
[3]: http://tools.genes.toronto.edu/deepbind/nbtcode
[4]: https://www.biorxiv.org/content/early/2017/11/10/217257
\ No newline at end of file
[4]: https://www.biorxiv.org/content/early/2018/10/08/217257
[5]: http://cnn.csail.mit.edu/motif_occupancy
[6]: http://www.bioinf.jku.at/software/LSTM_protein
......@@ -11,6 +11,8 @@ import Bio
import csv
import pandas as pd
from .dishuffle import doublet_shuffle
import torch
from torch.utils.data import Dataset
if sys.version_info < (3,):
import string
......@@ -21,20 +23,20 @@ else:
DNA_TRANS = maketrans('NACGT', '\x00\x01\x02\x03\x04')
DNA_COMP_TRANS = maketrans('ACGT', '\x04\x03\x02\x01')
def load_fasta(filename, alphabet=None):
return SeqIO.parse(filename, 'fasta', alphabet=alphabet)
# def load_fasta(filename, alphabet=None):
# return SeqIO.parse(filename, 'fasta', alphabet=alphabet)
def load_labels(filename):
labels_handler = open(filename, 'rU')
labels = csv.DictReader(labels_handler, delimiter='\t')
strength = labels.fieldnames[-2]
return np.array([float(label[strength]) for label in labels], dtype='float32')
# def load_labels(filename):
# labels_handler = open(filename, 'rU')
# labels = csv.DictReader(labels_handler, delimiter='\t')
# strength = labels.fieldnames[-2]
# return np.array([float(label[strength]) for label in labels], dtype='float32')
def seq_order(s):
l = s.split(':')
chrome = l[0][3:]
start = int(l[1].split('-')[0])
return (chrome, start)
# def seq_order(s):
# l = s.split(':')
# chrome = l[0][3:]
# start = int(l[1].split('-')[0])
# return (chrome, start)
def preprocess(filename, outpath, labelname=None, sort=True, remove_missing=False):
records = load_fasta(filename)
......@@ -101,16 +103,16 @@ def seq2index(seq, alphabet='DNA'):
seq = np.fromstring(seq, dtype='uint8')
return seq.astype('int32')
def revcomp(seq, alphabet='DNA'):
"""
reverse complement of the given sequence
"""
if alphabet == 'DNA':
translator = DNA_COMP_TRANS
else:
raise ValueError('not implemented!')
seq = seq.translate(translator)
return seq[::-1]
# def revcomp(seq, alphabet='DNA'):
# """
# reverse complement of the given sequence
# """
# if alphabet == 'DNA':
# translator = DNA_COMP_TRANS
# else:
# raise ValueError('not implemented!')
# seq = seq.translate(translator)
# return seq[::-1]
def pad_sequences(sequences, pre_padding=0, maxlen=None, dtype='int32',
padding='pre', truncating='pre', value=0.):
......@@ -186,6 +188,89 @@ def augment_data(df):
df2 = df_seq2index(df2)
return df2
def augment_noise(df, perturbation=0.1, quantity=10, max_index=4):
def add_noise(seq):
new_seq = seq.copy()
mask = np.random.rand(len(seq)) < perturbation
new_seq[mask] = np.random.randint(1, max_index+1, size=np.count_nonzero(mask), dtype='int32')
return new_seq
df_list = [df]
for i in range(quantity - 1):
new_df = df.copy()
new_df['seq_index'] = df['seq_index'].apply(add_noise)
df_list.append(new_df)
return pd.concat(df_list)
class TensorDataset(Dataset):
def __init__(self, data_tensor, target_tensor, label_mask=None, count_tensor=None, max_index=4):
assert data_tensor.size(0) == target_tensor.size(0)
self.data_tensor = data_tensor
self.target_tensor = target_tensor
self.label_mask = label_mask
self.count_tensor = count_tensor
self.max_index = max_index + 1
def set_mask(self, mask):
self.label_mask = mask
def __getitem__(self, index):
# if self.label_mask is not None:
# return {'data': self.data_tensor[index], 'label': self.target_tensor[index], 'mask': self.label_mask[index]}
if self.count_tensor is not None:
return {'data': self.data_tensor[index], 'label': self.target_tensor[index], 'count': self.count_tensor[index]}
return {'data': self.data_tensor[index], 'label': self.target_tensor[index]}
def __len__(self):
return self.data_tensor.size(0)
def augment(self, perturbation=0.1, quantity=10):
new_tensor = [self.data_tensor]
for i in range(quantity - 1):
t = self.data_tensor.clone()
mask = torch.Tensor(t.size()).uniform_() < perturbation
t[mask] = torch.LongTensor(mask.sum()).random_(1, self.max_index)
new_tensor.append(t)
new_tensor = torch.cat(new_tensor)
self.data_tensor = new_tensor
self.target_tensor = self.target_tensor.repeat(quantity)
class NoisyDataset(Dataset):
def __init__(self, data_tensor, target_tensor, amount=0.1, max_index=4):
assert data_tensor.size(0) == target_tensor.size(0)
self.data_tensor = data_tensor
self.target_tensor = target_tensor
self.amount = amount
self.max_index = max_index + 1
def __getitem__(self, index):
data_tensor = self.data_tensor[index].clone()
noise_mask = torch.ByteTensor([0])
if np.random.rand() < 0.5:
noise_mask = torch.ByteTensor([1])
mask = torch.Tensor(data_tensor.size(0)).uniform_() < self.amount
data_tensor[mask] = torch.LongTensor(mask.sum()).random_(1, self.max_index)
return {'data': data_tensor, 'label': self.target_tensor[index], 'mask': noise_mask}
def __len__(self):
return self.data_tensor.size(0)
# def rolling_window(a, window):
# shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
# strides = a.strides + (a.strides[-1],)
# return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
# def np_obtain_kmer_weights(seq, k, n_char=20):
# """n_char=4 for DNA 20 for protein
# return len=len(seq) - k + 1
# """
# kmers = rolling_window(seq, k)
# bases = np.logspace(k - 1, 0, num=k, base=n_char)
# kmers_idx = np.sum(kmers * bases, axis=1)
# _, indices, counts = np.unique(kmers_idx, return_inverse=True, return_counts=True)
# counts = 1./counts
# return counts[indices]
if __name__ == "__main__":
import sys
......@@ -9,3 +9,5 @@ def exp(x, alpha):
"""
return torch.exp(alpha*(x - 1.))
def res_exp(x, alpha):
return 0.5*exp(x, alpha) + 0.5*x
......@@ -26,7 +26,7 @@ class CKNLayer(nn.Module):
self.stride = stride
self.padding = padding
self.train(training)
self._is_compute_lintrans = True
self._need_computing_lintrans = True
self.patch_dim = self.in_channels * self.filter_size
self.kernel_args_trainable = kernel_args_trainable
......@@ -49,8 +49,8 @@ class CKNLayer(nn.Module):
g_filter = gaussian_filter_1d(2*self.stride - 1).expand(self.out_channels, 1, 2*self.stride-1)
self.g_filter = Variable(g_filter, requires_grad=False)
self.filters = nn.Parameter(torch.Tensor(self.out_channels, self.in_channels, self.filter_size))
# self.lintrans = Variable(torch.Tensor(self.out_channels, self.out_channels), requires_grad=False)
self.lintrans = None
self.lintrans = Variable(torch.Tensor(self.out_channels, self.out_channels), requires_grad=False)
# self.lintrans = None
self.reset_parameters()
indices_filter = torch.zeros(1, 1, self.filter_size)
......@@ -64,24 +64,31 @@ class CKNLayer(nn.Module):
self.filters.data.uniform_(-stdv, stdv)
@property
def is_compute_lintrans(self):
return self._is_compute_lintrans
def need_computing_lintrans(self):
return self._need_computing_lintrans
def train(self, mode=True):
super(CKNLayer, self).train(mode)
self._is_compute_lintrans = True
if self.training is True:
self._need_computing_lintrans = True
def _compute_lintrans(self):
"""Compute the linear transformation factor kappa(ZtZ)^(-1/2)
Returns:
lintrans: out_channels x out_channels
"""
if not self._need_computing_lintrans:
# print("Debug: reuse linear transform")
return self.lintrans
# print("Debug: compute linear transform")
lintrans = self.filters.view(self.out_channels, -1)
lintrans = lintrans.mm(lintrans.t())
lintrans = self.kappa(lintrans)
lintrans = ops.matrix_inverse_sqrt(lintrans, EPSILON)
# self.lintrans_sqrt = ops.matrix_sqrt(lintrans)
# lintrans = torch.inverse(lintrans)
# self.centroids = lintrans
lintrans = ops.matrix_inverse_sqrt(lintrans)
if not self.training:
self._need_computing_lintrans = False
self.lintrans.data = lintrans.data
return lintrans
def _conv_layer(self, x_in):
......@@ -101,7 +108,7 @@ class CKNLayer(nn.Module):
x_out = patch_norm * x_out
return x_out
def _mult_layer(self, x_in):
def _mult_layer(self, x_in, lintrans):
"""Multiplication layer
Compute x_out = kappa(ZtZ)^(-1/2) x x_in
Args:
......@@ -109,15 +116,8 @@ class CKNLayer(nn.Module):
self.lintrans: in_channels x in_channels
x_out: batch_size x in_channels x H
"""
if self._is_compute_lintrans:
# print("Debug: compute linear transform")
lintrans = self._compute_lintrans()
if not self.training:
self._is_compute_lintrans = False
self.lintrans = lintrans
else:
# print("Debug: reuse linear transform")
lintrans = self.lintrans
if x_in.dim() == 2:
return torch.mm(x_in, lintrans)
x_out = torch.bmm(lintrans.expand(x_in.size()[:1] + lintrans.size()), x_in)
return x_out
......@@ -131,25 +131,27 @@ class CKNLayer(nn.Module):
stride=self.stride, padding=self.stride-1, groups=self.out_channels)
return x_out
def forward(self, x_in):
def forward(self, x_in, global_pool=None, mask=None):
"""Encode function for a CKN layer
Args:
x_in: batch_size x in_channels x H
"""
x_out = self._conv_layer(x_in)
x_out = self._mult_layer(x_out)
x_out = self._pool_layer(x_out)
lintrans = self._compute_lintrans()
x_out = self._mult_layer(x_out, lintrans)
if global_pool is not None:
x_out = global_pool(x_out, mask=mask)
if global_pool is not None:
return x_out.view(x_out.size(0), -1)
return x_out
# def pooling(self, x_in):
# return x_in.mm(self.lintrans_sqrt)
def compute_mask(self, mask=None):
if mask is None:
return mask
mask = mask.float().unsqueeze(1)
mask = F.avg_pool1d(mask, kernel_size=self.filter_size, stride=self.stride)
mask = mask.squeeze() != 0
mask = mask.squeeze(1) != 0
return mask
def compute_indices(self, indices=None):
......@@ -168,10 +170,11 @@ class CKNLayer(nn.Module):
Returns:
patches: (batch_size x (H - filter_size + 1)) x (in_channels x filter_size)
"""
all_patches = x_in.unfold(-1, self.filter_size, 1).transpose(1, 2).contiguous().view(-1, self.patch_dim)
all_patches = x_in.data.unfold(-1, self.filter_size, 1).transpose(1, 2).contiguous().view(-1, self.patch_dim)
if mask is not None:
mask = mask.float().unsqueeze(1)
mask = F.avg_pool1d(mask, kernel_size=self.filter_size, stride=1)
mask = mask.data
mask = mask.view(-1) != 0
all_patches = all_patches[mask.view(-1, 1).expand_as(all_patches)].view(-1, self.patch_dim)
n_sampling_patches = min(all_patches.size(0), n_sampling_patches)
......@@ -190,11 +193,12 @@ class CKNLayer(nn.Module):
filters = spherical_kmeans(patches, self.out_channels)
filters = filters.view_as(self.filters.data)
self.filters.data = filters
self._is_compute_lintrans = True
self._need_computing_lintrans = True
def _apply(self, fn):
super(CKNLayer, self)._apply(fn)
self.one_filters.data = fn(self.one_filters.data)
self.lintrans.data = fn(self.lintrans.data)
self.g_filter.data = fn(self.g_filter.data)
if self.kernel_args is not None and not self.kernel_args_trainable:
for kernel_arg in self.kernel_args:
......@@ -206,11 +210,11 @@ def _reverse(x, dim=-1):
"""
reverse_indices = torch.arange(x.size(dim) - 1, -1, -1)
reverse_indices = reverse_indices.type_as(x.data)
reverse_indices = reverse_indices.long()
reverse_indices = Variable(reverse_indices.long(), requires_grad=False)
return x.index_select(dim=dim, index=reverse_indices)
class BioEmbedding(nn.Module):
def __init__(self, num_embeddings, reverse_complement=False, mask_zeros=False):
def __init__(self, num_embeddings, reverse_complement=False, mask_zeros=False, remove_mean=False):
"""Embedding layer for biosequences
Args:
num_embeddings (int): number of letters in alphabet
......@@ -235,11 +239,11 @@ class BioEmbedding(nn.Module):
weight[0] = 1./self.num_embeddings
weight[1:] = np.fliplr(np.diag(np.ones(self.num_embeddings)))
weight = torch.from_numpy(weight)
# print(weight)
else:
weight = torch.zeros(self.num_embeddings + 1, self.num_embeddings)
weight[0] = 1./self.num_embeddings
weight[1:] = torch.diag(torch.ones(self.num_embeddings))
# weight -= 1./self.num_embeddings
return weight
def compute_mask(self, x):
......@@ -272,11 +276,23 @@ class BioEmbedding(nn.Module):
if self.reverse_complement:
self.weight_rc.data = fn(self.weight_rc.data)
class Normalize(nn.Module):
def __init__(self, mean, scale):
super(Normalize, self).__init__()
self.mean = nn.Parameter(torch.Tensor(mean))
self.scale = nn.Parameter(torch.Tensor(scale))
def forward(self, x):
return (x - self.mean) / self.scale
class Normalize2(nn.Module):
def forward(self, x):
out = x - x.mean(dim=1, keepdim=True)
return out / out.norm(dim=1, keepdim=True)
class GlobalAvg1D(nn.Module):
def __init__(self, in_features):
def __init__(self):
super(GlobalAvg1D, self).__init__()
self.in_features = in_features
self.out_features = in_features
def forward(self, x, mask=None):
if mask is None:
......@@ -286,10 +302,8 @@ class GlobalAvg1D(nn.Module):
return x.sum(dim=-1)/mask.sum(dim=-1)
class GlobalMax1D(nn.Module):
def __init__(self, in_features):
def __init__(self):
super(GlobalMax1D, self).__init__()
self.in_features = in_features
self.out_features = in_features
def forward(self, x, mask=None):
if mask is not None:
......@@ -297,4 +311,3 @@ class GlobalMax1D(nn.Module):
mask = mask.data
x[~mask] = -float("inf")
return x.max(dim=-1)[0]
This diff is collapsed.
This diff is collapsed.
......@@ -71,7 +71,8 @@ def optimize_motif(motif_idx, ckn_model, max_iter):
filter_c = ckn_model[-1].filters.data[motif_idx:motif_idx+1].clone()
filter_c = Variable(filter_c, volatile=True)
filter_c = ckn_model[-1]._conv_layer(filter_c)
filter_c = ckn_model[-1]._mult_layer(filter_c)
lintrans = ckn_model[-1]._compute_lintrans()
filter_c = ckn_model[-1]._mult_layer(filter_c, lintrans)
filter_c = Variable(filter_c.data, requires_grad=False)
def eval_loss(w):
......@@ -82,7 +83,8 @@ def optimize_motif(motif_idx, ckn_model, max_iter):
proj_on_simplex(motif.data)
out1 = ckn_model(motif.unsqueeze(0), n_layers-1)
out1 = ckn_model[-1]._conv_layer(out1)
out2 = ckn_model[-1]._mult_layer(out1)
lintrans = ckn_model[-1]._compute_lintrans()
out2 = ckn_model[-1]._mult_layer(out1, lintrans)
loss = torch.norm(out2 - filter_c)**2
loss.backward()
return loss.data[0]
......
......@@ -27,53 +27,3 @@ def matrix_inverse_sqrt(input, eps=1e-10):
"""Wrapper for MatrixInverseSqrt"""
eps = Variable(torch.Tensor([eps]).type_as(input.data), requires_grad=False)
return MatrixInverseSqrt()(input, eps)[0]
class MatrixSqrt(torch.autograd.Function):
"""Matrix inverse square root for a symmetric definite positive matrix
"""
def forward(self, input):
e, v = torch.symeig(input, eigenvectors=True)
e_sqrt = torch.sqrt(e)
self.save_for_backward(e_sqrt, v)
output = v.mm(torch.diag(e_sqrt).mm(v.t()))
return output, e_sqrt, v
def backward(self, grad_output, grad_e_sqrt, grad_v):
e_sqrt, v = self.saved_tensors
grad_input = None
ei = e_sqrt.expand_as(v)
ej = e_sqrt.view([-1,1]).expand_as(v)
f = torch.reciprocal(ei+ej)
grad_input = v.mm((f*(v.t().mm(grad_output.mm(v)))).mm(v.t()))
return grad_input
def matrix_sqrt(input):
"""Wrapper for MatrixInverseSqrt"""
return MatrixSqrt()(input)[0]
class MatrixSolve(torch.autograd.Function):
"""Matrix inverse square root for a symmetric definite positive matrix
"""
def forward(self, matrix, rhs):
matrix_lu = torch.btrifact(matrix)
out = rhs.btrisolve(*matrix_lu)
self.save_for_backward(out, *matrix_lu)
return out, matrix_lu[0], matrix_lu[1]
def backward(self, grad_output, lu_data_grad, lu_pivot_grad):
out, matrix_lu, matrix_lu_i = self.saved_tensors
rhs_grad = None
matrix_grad = None
if self.needs_input_grad[1]:
rhs_grad = grad_output.btrisolve(matrix_lu, matrix_lu_i)
if self.needs_input_grad[0]:
if rhs_grad is not None:
matrix_grad = -rhs_grad.bmm(out.transpose(1, 2))
else:
aux = grad_output.btrisolve(matrix_lu, matrix_lu_i)
matrix_grad = -aux.bmm(out.transpose(1, 2))
return matrix_grad, rhs_grad
def matrix_solve(matrix, rhs):
return MatrixSolve()(matrix, rhs)[0]
\ No newline at end of file
import os
from Bio import SeqIO
import numpy as np
import torch
# from torch.utils.data import Dataset
from .data_helper import TensorDataset, NoisyDataset
from sklearn.model_selection import train_test_split
import sys
if sys.version_info < (3,):
import string
maketrans = string.maketrans
else:
maketrans = str.maketrans
def pad_sequences(sequences, pre_padding=0, maxlen=None, dtype='int32',
padding='post', truncating='pre', value=0.):
if not hasattr(sequences, '__len__'):
raise ValueError('`sequences` must be iterable.')
lengths = []
for x in sequences:
if not hasattr(x, '__len__'):
raise ValueError('`sequences` must be a list of iterables. '
'Found non-iterable: ' + str(x))
lengths.append(len(x))
num_samples = len(sequences)
if maxlen is None:
maxlen = np.max(lengths) + 2*pre_padding
# take the sample shape from the first non empty sequence
# checking for consistency in the main loop below.
sample_shape = tuple()
for s in sequences:
if len(s) > 0:
sample_shape = np.asarray(s).shape[1:]
break
x = (np.ones((num_samples, maxlen) + sample_shape) * value).astype(dtype)
for idx, s in enumerate(sequences):
if not len(s):
continue # empty list/array was found
pre_pad = [value]*pre_padding
s = np.hstack([pre_pad, s, pre_pad])
if truncating == 'pre':
trunc = s[-maxlen:]
elif truncating == 'post':
trunc = s[:maxlen]
else:
raise ValueError('Truncating type "%s" not understood' % truncating)
# check `trunc` has expected shape
trunc = np.asarray(trunc, dtype=dtype)
if trunc.shape[1:] != sample_shape:
raise ValueError('Shape of sample %s of sequence at position %s is different from expected shape %s' %
(trunc.shape[1:], idx, sample_shape))
if padding == 'post':
x[idx, :len(trunc)] = trunc
elif padding == 'pre':
x[idx, -len(trunc):] = trunc
else:
raise ValueError('Padding type "%s" not understood' % padding)
return x
class SCOPLoader(object):
def __init__(self, datadir='data/SCOP', ext='fasta', maxlen=None, len_motif=None,
pre_padding=0, swissdir='../data/SCOP/uniprot_sprot.fasta'):
self.datadir = datadir
self.ext = ext
self.filename_tp = datadir + '/{}-{}.{}.' + ext
self.dict = ['A', 'R', 'N', 'D', 'C', 'Q', 'E', 'G', 'H', 'I', 'L', 'K', 'M', 'F',
'P', 'S', 'T', 'W', 'Y', 'V']
self.ambiguous = 'XBZJUO'
self.protein_trans = maketrans(self.ambiguous + ''.join(self.dict),
'\x00'*len(self.ambiguous) + '\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14')
self.maxlen = maxlen
self.pre_padding = pre_padding
self.swissdir = swissdir
if len_motif is None:
self.len_motif = pre_padding + 1
else:
self.len_motif = len_motif
def get_ids(self, indices=None):
names = sorted([".".join(filename.split('.')[1:-1])
for filename in os.listdir(self.datadir)
if filename.startswith('pos-train')])
if indices is not None and indices != []:
names = [names[index] for index in indices if index < len(names)]
return names
def get_nb_fasta(self, filename):
records = SeqIO.parse(filename, self.ext)
return sum([1 for r in records])
def get_nb(self, dataid, split='train'):
pos = self.filename_tp.format('pos', split, dataid)
neg = self.filename_tp.format('neg', split, dataid)
return self.get_nb_fasta(pos), self.get_nb_fasta(neg)
def seq2index(self, seq):
seq = seq.translate(self.protein_trans)
seq = np.fromstring(seq, dtype='uint8').astype('int64')
return seq
def read_fasta(self, filename):
records = SeqIO.parse(filename, self.ext)
sequences = []
names = []
for record in records:
seq = str(record.seq).upper()
name = record.id
sequences.append(self.seq2index(seq))
names.append(name)
# sequences = pad_sequences(sequences)
return sequences, names
def get_swissprot(self, torch_tensor=True, max_size=100000):