Commit e8448e01 authored by dexiong chen's avatar dexiong chen

update CKN-seq

parent 25adc188
......@@ -3,6 +3,8 @@ This is the Pytorch implementation of CKN-seq for reproducing the results of the
>Dexiong Chen, Laurent Jacob, Julien Mairal.
[Biological Sequence Modeling with Convolutional Kernel Networks][4]. preprint BiorXiv. 2018.
#### If you want to reproduce the results in the paper please use the branch `recomb2019`.
CKN-seq is free for academic use only
© 2018 All rights reserved - Inria, CNRS, Univ. Grenoble Alpes
......@@ -18,7 +20,7 @@ We strongly recommend users to use [anaconda][1] to install the following packag
numpy
scipy
scikit-learn=0.19.0
pytorch=0.3.0
pytorch=0.4.1
biopython=1.69
pandas
matplotlib
......@@ -31,41 +33,41 @@ Then we install CKN-seq
pip install .
```
## Reproduction of results for ENCODE ChIP-seq datasets.
or simply run
```
export PYTHONPATH=$PWD:$PYTHONPATH
```
## Training model on ENCODE ChIP-seq and SCOP 1.67 datasets.
Download data from [DeepBind website][3] and put encode folder to data/.
Download data from [DeepBind website][3] and put encode folder to data/. Download data from [jlstm website][6] and put it to data/SCOP167-superfamily/.
#### Supervised training without perturbation
* One-layer model
If you want to reproduce the results for a one-layer CKN-seq
If you want to train a one-layer CKN-seq
```
cd experiments
python train_encode.py --rc --num-layers 1 --num-motifs 16 --len-motifs 12 --subsamplings 1 --kernel-params 0.3 --epochs 100 --batch-size 128 --ntrials 1 --method sup --pooling mean --outdir ../out
python train_encode.py --num-motifs 128 --method sup --outdir ../out
```
You can add `--use-cuda` if you have cuda installed.
If you want to use max pooling, just replace `--pooling mean` with `--pooling max` and set a lager sigma, such as `--kernel-params 0.6`. This setting was used for training models on SCOP 1.67 datasets.
* Two-layer model
on SCOP 1.67
```
cd experiments
python train_encode.py --num-layers 2 --num-motifs 64 16 --len-motifs 12 3 --subsamplings 2 1 --kernel-params 0.3 0.6 --epochs 120 --batch-size 128 --ntrials 1 --method sup --pooling mean --outdir ../out --rc
python train_scop.py --num-motifs 128 --method sup --outdir ../out
```
#### Generation of logos
You can add `--use-cuda` if you have cuda installed. Specify `--tfid INDEX_OF_DATASET` to train a model on a given dataset. Add `--logo` when training and evaluating your model to generate sequence logos.
There are two ways of generating logos
* Two-layer model
1. Add `--logo` when training and evaluating your model.
2. After training a model and save to `../out` then you can run
```
python report_encode.py --rc --num-layers 1 --num-motifs 16 --len-motifs 12 --subsamplings 1 --kernel-params 0.3 --epochs 120 --batch-size 128 --ntrials 1 --method sup --pooling mean --outdir ../out
cd experiments
python train_encode.py --num-layers 2 --num-motifs 64 16 --len-motifs 12 3 --subsamplings 2 1 --kernel-params 0.3 0.6 --method sup --outdir ../out
```
#### Unsupervised training
......@@ -73,37 +75,52 @@ There are two ways of generating logos
CKN-seq also provides an unsupervised training fashion, however the number of filters should be much larger to achieve similar performance as supervised training. We give an example of training a one-layer model here
```
python train_encode.py --rc --num-layers 1 --num-motifs 512 --len-motifs 12 --subsamplings 1 --kernel-params 0.3 --batch-size 200 --ntrials 1 --method unsup --pooling mean --outdir ../out
python train_encode.py --num-motifs 4096 --batch-size 200 --method unsup --outdir ../out
```
#### Augmented training and hybrid training with perturbation
supervised CKN-seq can achieve better performance on _small-scale_ datasets by augmenting training samples with mismatch noises. An example for training with adding 20% mismatch noise to training samples
Supervised CKN-seq can achieve better performance on _small-scale_ datasets by augmenting training samples with mismatch noises. An example for training with adding 20% mismatch noise to training samples
```
python train_encode.py --rc --num-layers 1 --num-motifs 16 --len-motifs 12 --subsamplings 1 --kernel-params 0.3 --epochs 100 --batch-size 128 --ntrials 1 --method augmentation --pooling mean --outdir ../out --perturbation 0.2
python train_encode.py --epochs 150 --method sup --outdir ../out --noise 0.2
```
It can be further improved using the hybrid model described in the paper
It can be further improved using the hybrid model described in the paper once you have trained an unsupervised model saved in `$PATH_TO_A_TRAINED_UNSUP_MODEL`.
```
python train_encode.py --rc --num-layers 1 --num-motifs 16 --len-motifs 12 --subsamplings 1 --kernel-params 0.3 --epochs 100 --batch-size 128 --ntrials 1 --method semisup --pooling mean --outdir ../out --perturbation 0.2 --unsup-dir $PATH_TO_A_TRAINED_UNSUP_MODEL
python train_encode.py --epochs 150 --method sup --outdir ../out --noise 0.2 --unsup-dir $PATH_TO_A_TRAINED_UNSUP_MODEL
```
## Reproduction of results for Zeng's motif occupancy datasets
## Further use of CKN-seq
Download data from [website][5] and put it to data/encode_occupancy/. Then check
If you want to create a CKN-seq network, run
```
python train_encode_occupancy --help
from ckn.models import supCKN, unsupCKN
if method == "sup":
M = supCKN
elif method == "unsup":
M = unsupCKN
n_alphabet = 4
model = M(4, [16], [12], [1], alpha=1e-6, reverse_complement=True, fit_bias=True,
global_pool='mean')
# consult ckn/models.py for more options
```
## Reproduction of results for SCOP 1.67 datasets
Download data from [jlstm website][6] and put it to data/SCOP167-superfamily/.
Then you can train the network with a given data loader
```
python train_scop.py --help
# sup train
model.sup_train(data_loader, criterion, optimizer, lr_scheduler, epochs=100,
val_loader=val_loader, use_cuda=True)
# unsup train with cross validation for regularization parameter
model.unsup_cross_val(train_loader,
n_sampling_patches=args.n_sampling_patches,
alpha_grid=search_grid,
use_cuda=True)
```
[1]: https://anaconda.org
......
from .ops import *
from .kernels import *
......@@ -5,12 +5,12 @@ from __future__ import division
import sys
from Bio import SeqIO
from Bio.Seq import Seq
import numpy as np
import Bio
import csv
import pandas as pd
from .dishuffle import doublet_shuffle
# from Bio.Seq import Seq
import numpy as np
# import Bio
# import csv
import pandas as pd
# from .dishuffle import doublet_shuffle
import torch
from torch.utils.data import Dataset
......@@ -20,99 +20,10 @@ if sys.version_info < (3,):
else:
maketrans = str.maketrans
DNA_TRANS = maketrans('NACGT', '\x00\x01\x02\x03\x04')
DNA_COMP_TRANS = maketrans('ACGT', '\x04\x03\x02\x01')
# def load_fasta(filename, alphabet=None):
# return SeqIO.parse(filename, 'fasta', alphabet=alphabet)
def load_fasta(filename, alphabet=None):
return SeqIO.parse(filename, 'fasta', alphabet=alphabet)
# def load_labels(filename):
# labels_handler = open(filename, 'rU')
# labels = csv.DictReader(labels_handler, delimiter='\t')
# strength = labels.fieldnames[-2]
# return np.array([float(label[strength]) for label in labels], dtype='float32')
# def seq_order(s):
# l = s.split(':')
# chrome = l[0][3:]
# start = int(l[1].split('-')[0])
# return (chrome, start)
def preprocess(filename, outpath, labelname=None, sort=True, remove_missing=False):
records = load_fasta(filename)
labels = None
if labelname is not None:
labels_handler = open(labelname, 'rU')
labels = csv.DictReader(labels_handler, delimiter='\t')
fieldnames = labels.fieldnames
ori_id = labels.fieldnames[-1]
if sort:
records = sorted(records, key=lambda x: seq_order(x.id))
labels = sorted(labels, key=lambda x: seq_order(x[ori_id]))
if remove_missing:
records = [record.upper() for record in records if not 'n' in record and not 'N' in record]
i = 0
new_labels = []
for label in labels:
if i >= len(records):
break
if label[ori_id] == records[i].id:
new_labels.append(label)
i += 1
missing_seq = len(labels) - len(new_labels)
print('removed %d missing sequences' % missing_seq)
labels = new_labels
if labelname is not None:
labels_handler.close()
SeqIO.write(records, outpath+'cleaned_seq.fa', 'fasta')
with open(outpath+'cleaned_label.beb', 'w+') as handler:
writer = csv.DictWriter(handler, fieldnames=fieldnames, delimiter='\t')
writer.writeheader()
writer.writerows(labels)
def seq_to_index(records, alphabet):
"""
transform biological sequences to list of indices
"""
if alphabet == 'DNA':
translator = DNA_TRANS
else:
raise ValueError('not implemented!')
sequences = []
for record in records:
seq = map(int, str(record.seq).translate(translator))
sequences.append(seq)
return sequences
def seq2index(seq, alphabet='DNA'):
"""
transform biological sequence {A,C,G,T} to {1,2,3,4}
"""
if alphabet == 'DNA':
translator = DNA_TRANS
else:
raise ValueError('not implemented!')
seq = seq.translate(translator)
seq = np.fromstring(seq, dtype='uint8')
return seq.astype('int32')
# def revcomp(seq, alphabet='DNA'):
# """
# reverse complement of the given sequence
# """
# if alphabet == 'DNA':
# translator = DNA_COMP_TRANS
# else:
# raise ValueError('not implemented!')
# seq = seq.translate(translator)
# return seq[::-1]
def pad_sequences(sequences, pre_padding=0, maxlen=None, dtype='int32',
padding='pre', truncating='pre', value=0.):
......@@ -146,7 +57,8 @@ def pad_sequences(sequences, pre_padding=0, maxlen=None, dtype='int32',
elif truncating == 'post':
trunc = s[:maxlen]
else:
raise ValueError('Truncating type "%s" not understood' % truncating)
raise ValueError(
'Truncating type "%s" not understood' % truncating)
# check `trunc` has expected shape
trunc = np.asarray(trunc, dtype=dtype)
......@@ -162,36 +74,11 @@ def pad_sequences(sequences, pre_padding=0, maxlen=None, dtype='int32',
raise ValueError('Padding type "%s" not understood' % padding)
return x
def load_data(filename, compression='gzip'):
return pd.read_csv(filename, compression=compression, delimiter='\t')
def df_seq2index(df):
"""convert letters to numbers:
A=1, C=2, G=3, T=4, N=0
"""
df['seq_index'] = df['seq'].apply(seq2index)
return df
def load_seq(filename, compression='gzip', mode='top'):#, data_augment=False, reverse_complement=False):
df = load_data(filename, compression)
if mode == 'top':
df = df.iloc[:500]
df = df_seq2index(df)
return df
def augment_data(df):
df2 = df.copy()
df2['seq'] = df['seq'].apply(doublet_shuffle)
df2['Bound'] = 0
df2['EventID'] += '_background'
df2.index += df.shape[0]
df2 = df_seq2index(df2)
return df2
def augment_noise(df, perturbation=0.1, quantity=10, max_index=4):
def augment(df, noise=0.1, quantity=10, max_index=4):
def add_noise(seq):
new_seq = seq.copy()
mask = np.random.rand(len(seq)) < perturbation
mask = np.random.rand(len(seq)) < noise
new_seq[mask] = np.random.randint(1, max_index+1, size=np.count_nonzero(mask), dtype='int32')
return new_seq
df_list = [df]
......@@ -201,76 +88,39 @@ def augment_noise(df, perturbation=0.1, quantity=10, max_index=4):
df_list.append(new_df)
return pd.concat(df_list)
class TensorDataset(Dataset):
def __init__(self, data_tensor, target_tensor, label_mask=None, count_tensor=None, max_index=4):
def __init__(self, data_tensor, target_tensor, noise=0.0, max_index=4):
assert data_tensor.size(0) == target_tensor.size(0)
self.data_tensor = data_tensor
self.target_tensor = target_tensor
self.label_mask = label_mask
self.count_tensor = count_tensor
self.data_tensor = data_tensor
self.target_tensor = target_tensor
self.noise = noise
self.max_index = max_index + 1
def set_mask(self, mask):
self.label_mask = mask
def __getitem__(self, index):
# if self.label_mask is not None:
# return {'data': self.data_tensor[index], 'label': self.target_tensor[index], 'mask': self.label_mask[index]}
if self.count_tensor is not None:
return {'data': self.data_tensor[index], 'label': self.target_tensor[index], 'count': self.count_tensor[index]}
return {'data': self.data_tensor[index], 'label': self.target_tensor[index]}
if self.noise == 0.:
return self.data_tensor[index], self.target_tensor[index]
data_tensor = self.data_tensor[index].clone()
noise_mask = torch.ByteTensor([0])
if np.random.rand() < 0.5:
noise_mask = torch.ByteTensor([1])
# mask = torch.Tensor(data_tensor.size(0)).uniform_() < self.noise
mask = torch.rand_like(data_tensor, dtype=torch.float) < self.noise
data_tensor[mask] = torch.LongTensor(
mask.sum().item()).random_(1, self.max_index)
return data_tensor, self.target_tensor[index], noise_mask
def __len__(self):
return self.data_tensor.size(0)
def augment(self, perturbation=0.1, quantity=10):
def augment(self, noise=0.1, quantity=10):
if noise <= 0.:
return
new_tensor = [self.data_tensor]
for i in range(quantity - 1):
t = self.data_tensor.clone()
mask = torch.Tensor(t.size()).uniform_() < perturbation
t[mask] = torch.LongTensor(mask.sum()).random_(1, self.max_index)
mask = torch.rand_like(t, dtype=torch.float) < noise
t[mask] = torch.LongTensor(mask.sum().item()).random_(1, self.max_index)
new_tensor.append(t)
new_tensor = torch.cat(new_tensor)
self.data_tensor = new_tensor
self.data_tensor = torch.cat(new_tensor)
self.target_tensor = self.target_tensor.repeat(quantity)
class NoisyDataset(Dataset):
def __init__(self, data_tensor, target_tensor, amount=0.1, max_index=4):
assert data_tensor.size(0) == target_tensor.size(0)
self.data_tensor = data_tensor
self.target_tensor = target_tensor
self.amount = amount
self.max_index = max_index + 1
def __getitem__(self, index):
data_tensor = self.data_tensor[index].clone()
noise_mask = torch.ByteTensor([0])
if np.random.rand() < 0.5:
noise_mask = torch.ByteTensor([1])
mask = torch.Tensor(data_tensor.size(0)).uniform_() < self.amount
data_tensor[mask] = torch.LongTensor(mask.sum()).random_(1, self.max_index)
return {'data': data_tensor, 'label': self.target_tensor[index], 'mask': noise_mask}
def __len__(self):
return self.data_tensor.size(0)
# def rolling_window(a, window):
# shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
# strides = a.strides + (a.strides[-1],)
# return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
# def np_obtain_kmer_weights(seq, k, n_char=20):
# """n_char=4 for DNA 20 for protein
# return len=len(seq) - k + 1
# """
# kmers = rolling_window(seq, k)
# bases = np.logspace(k - 1, 0, num=k, base=n_char)
# kmers_idx = np.sum(kmers * bases, axis=1)
# _, indices, counts = np.unique(kmers_idx, return_inverse=True, return_counts=True)
# counts = 1./counts
# return counts[indices]
if __name__ == "__main__":
import sys
import numpy as np
from collections import defaultdict
import numpy as np
from collections import defaultdict
import sys
if sys.version_info < (3,):
import string
......@@ -9,22 +9,25 @@ else:
_RNA = maketrans('U', 'T')
def seq_to_graph(seq):
graph = defaultdict(list)
for i in range(len(seq) - 1):
graph[seq[i]].append(seq[i+1])
return graph
def graph_to_seq(graph, first_vertex, len_seq):
is_done = False
new_seq = first_vertex
while not is_done:
last_vertex = new_seq[-1]
last_vertex = new_seq[-1]
new_seq += graph[last_vertex][0]
graph[last_vertex].pop(0)
if len(new_seq) >= len_seq:
is_done = True
return new_seq
return new_seq
# function from: http://www.python.org/doc/essays/graphs/
def find_path(graph, start, end, path=[]):
......@@ -36,8 +39,10 @@ def find_path(graph, start, end, path=[]):
for node in graph[start][-1]:
if node not in path:
newpath = find_path(graph, node, end, path)
if newpath: return newpath
return None
if newpath:
return newpath
return None
def sample_new_graph(graph, last_vertex):
new_graph = defaultdict(list)
......
import os
import sys
import numpy as np
import pandas as pd
import torch
from Bio import SeqIO
from collections import defaultdict
from sklearn.model_selection import train_test_split
from .data_helper import pad_sequences, TensorDataset, augment
from .dishuffle import doublet_shuffle
if sys.version_info < (3,):
import string
maketrans = string.maketrans
else:
maketrans = str.maketrans
ALPHABETS = {
'DNA': (
'ACGT',
'\x01\x02\x03\x04'
),
'PROTEIN': (
'ARNDCQEGHILKMFPSTWYV',
'\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14'
),
}
AMBIGUOUS = {
'DNA': ('N', '\x00'),
'PROTEIN': ('XBZJUO', '\x00' * 6),
}
class SeqLoader(object):
def __init__(self, alphabet='DNA'):
self.alphabet, self.code = ALPHABETS[alphabet]
alpha_ambi, code_ambi = AMBIGUOUS[alphabet]
self.translator = maketrans(
alpha_ambi + self.alphabet, code_ambi + self.code)
self.alpha_nb = len(self.alphabet)
def pad_seq(self, seq):
seq = pad_sequences(seq, pre_padding=self.pre_padding,
maxlen=self.maxlen, padding='post',
truncating='post', dtype='int64')
self.maxlen = seq.shape[1]
return seq
def seq2index(self, seq):
seq = seq.translate(self.translator)
seq = np.fromstring(seq, dtype='uint8')
return seq.astype('int64')
def get_ids(self, ids=None):
pass
def get_tensor(self, dataid, split='train', noise=0.0, fixed_noise=0.0,
val_split=0.25, top=False, generate_neg=True):
df = self.load_data(dataid, split)
if fixed_noise > 0.:
df = augment(df, noise=fixed_noise, max_index=self.alpha_nb)
if split == 'train' and generate_neg and hasattr(self, "aug_neg"):
df = self.aug_neg(df)
X, y = df['seq_index'], df['y'].values
X = self.pad_seq(X)
if top:
X, _, y, _ = train_test_split(
X, y, stratify=y, train_size=500)
if split == 'train' and val_split > 0:
X, X_val, y, y_val = train_test_split(
X, y, test_size=val_split, stratify=y)
X, y = torch.from_numpy(X), torch.from_numpy(y)
X_val, y_val = torch.from_numpy(X_val), torch.from_numpy(y_val)
train_dset = TensorDataset(
X, y, noise=noise, max_index=self.alpha_nb)
val_dset = TensorDataset(X_val, y_val, max_index=self.alpha_nb)
return train_dset, val_dset
X, y = torch.from_numpy(X), torch.from_numpy(y)
return TensorDataset(X, y, noise=noise, max_index=self.alpha_nb)
class EncodeLoader(SeqLoader):
def __init__(self, datadir, ext='.seq.gz', maxlen=None,
pre_padding=0):
super(EncodeLoader, self).__init__()
self.datadir = datadir
self.ext = ext
self.pre_padding = pre_padding
self.maxlen = maxlen
def get_ids(self, ids=None):
targetnames = sorted(
[filename.replace("_AC"+self.ext, "") for filename in os.listdir(
self.datadir) if filename.endswith("_AC"+self.ext)])
if ids is not None and ids != []:
targetnames = [
targetnames[tfid] for tfid in ids if tfid < len(targetnames)
]
return targetnames
def load_data(self, tfid, split='train'):
if split == 'train':
filename = "%s/%s_AC%s" % (self.datadir, tfid, self.ext)
else:
filename = "%s/%s_B%s" % (self.datadir, tfid, self.ext)
df = pd.read_csv(filename, compression='gzip', delimiter='\t')
df.rename(columns={'Bound': 'y', 'seq': 'Seq'}, inplace=True)
df['seq_index'] = df['Seq'].apply(self.seq2index)
return df
def aug_neg(self, df):
df2 = df.copy()
df2['Seq'] = df['Seq'].apply(doublet_shuffle)
df2['y'] = 0
df2['EventID'] += '_neg'
df2.index += df2.shape[0]
df2['seq_index'] = df2['Seq'].apply(self.seq2index)
df2 = pd.concat([df, df2])
return df2
class SCOPLoader(SeqLoader):
def __init__(self, datadir='data/SCOP', ext='fasta', maxlen=400,
pre_padding=0):
super(SCOPLoader, self).__init__('PROTEIN')
self.datadir = datadir
self.ext = ext
self.pre_padding = pre_padding
self.maxlen = maxlen
self.filename_tp = datadir + '/{}-{}.{}.' + ext
def get_ids(self, ids=None):
names = sorted([".".join(filename.split('.')[1:-1])
for filename in os.listdir(self.datadir)
if filename.startswith('pos-train')])
if ids is not None and ids != []:
names = [names[index] for index in ids if index < len(names)]
return names
def get_nb(self, dataid, split='train'):
pos = self.filename_tp.format('pos', split, dataid)
neg = self.filename_tp.format('neg', split, dataid)
return self.get_nb_fasta(pos) + self.get_nb_fasta(neg)
def load_data(self, dataid, split='train'):
pos = self.filename_tp.format('pos', split, dataid)
neg = self.filename_tp.format('neg', split, dataid)
table = defaultdict(list)
for y, filename in enumerate([neg, pos]):
for record in SeqIO.parse(filename, self.ext):
seq = str(record.seq).upper()
name = record.id
table['Seq'].append(seq)
table['seq_index'].append(self.seq2index(seq))
table['name'].append(name)
table['y'].append(y)
df = pd.DataFrame(table)
return df
if __name__ == '__main__':
load = EncodeLoader('../../data/encode')
......@@ -4,26 +4,27 @@ from matplotlib.text import TextPath
from matplotlib.patches import PathPatch
from matplotlib.font_manager import FontProperties
import seaborn as sns
sns.set_style("white")
import numpy as np
sns.set_style("white")
import numpy as np
DNA_ALPHABET = ['A', 'C', 'G', 'T']
DEEPBIND_PL = {'G': 'orange',
'A': 'lime',
'C': 'blue',
'T': 'crimson'}
DEEPBIND_PL = {'G': 'orange',
'A': 'lime',
'C': 'blue',
'T': 'crimson'}
MEME_PL = {'G': 'orange',
'A': 'red',
'C': 'blue',
'T': 'darkgreen'}
MEME_PL = {'G': 'orange',
'A': 'red',
'C': 'blue',
'T': 'darkgreen'}
PALLETE = {'meme': MEME_PL, 'deepbind': DEEPBIND_PL}
def trim_pwm(info):
def trim_pwm(info):
max_ic = np.max(info)
ic_threshold = np.max([0.1*max_ic, 0.1]);
ic_threshold = np.max([0.1*max_ic, 0.1])
w = info.shape[0]
pfm_st = 0
pfm_sp = w
......@@ -32,18 +33,20 @@ def trim_pwm(info):
pfm_st += 1
else:
break
for inf in reversed(info):
if inf < ic_threshold:
pfm_sp -= 1
for inf in reversed(info):
if inf < ic_threshold:
pfm_sp -= 1
else:
break
return pfm_st, pfm_sp
return pfm_st, pfm_sp
def pwm_to_bits(pwm, trim=False):
"""pwm: m x 4 dataframe
"""
pwm = np.asarray(pwm)
info = np.asarray(np.log2(pwm.shape[1]) + np.sum(pwm * np.ma.log2(pwm), axis=1, keepdims=True))
info = np.asarray(np.log2(
pwm.shape[1]) + np.sum(pwm * np.ma.log2(pwm), axis=1, keepdims=True))
print("IC: {}".format(np.sum(info)))
bits = pwm * info
if trim:
......@@ -51,14 +54,13 @@ def pwm_to_bits(pwm, trim=False):
return bits[bits_st:bits_sp]
return bits
def draw_logo(bits, width_scale=0.8, font_family="Arial", palette="meme"):
# bits = pwm_to_bits(pwm)
# print(np.sum(bits))
try:
alphabet_inverse = pwm.columns[::-1]
alphabet_inverse = bits.columns[::-1]
except:
alphabet_inverse = DNA_ALPHABET[::-1]
fp = FontProperties(family=font_family, weight="bold")
fp = FontProperties(family=font_family, weight="bold")
COLOR_SCHEME = PALLETE[palette]
......@@ -72,9 +74,12 @@ def draw_logo(bits, width_scale=0.8, font_family="Arial", palette="meme"):
w_scale = 1./w
h_scale = yscale * 0.95 / h
transform = mpl.transforms.Affine2D().scale(w_scale, h_scale) + mpl.transforms.Affine2D().translate(x - 0.5 - w_scale*anchor, y) + ax.transData
txt = PathPatch(text, lw=0, fc=COLOR_SCHEME[letter], transform=transform)
if ax != None:
transform = mpl.transforms.Affine2D().scale(w_scale, h_scale) + \
mpl.transforms.Affine2D().translate(x - 0.5 - w_scale*anchor, y) + \
ax.transData
txt = PathPatch(
text, lw=0, fc=COLOR_SCHEME[letter], transform=transform)
if ax is not None: