…
Generally speaking, the Python API for khmer and oxli assume that
they are receiving valid DNA (ACGT). Low-level hash functions like
hash(kmer)
and mid-level hash functions like hash_kmer_hashes(str)
will neither check for correct DNA nor is their output with respect to
incorrect DNA characters specified.
However, bulk loading functions provide a cleaned_seq
attribute that
will … document me here.
Type names consist of two parts. The first part indicates how far the type can count and the second part whether it is a table or a graph.
Possible choices for the first part: * Node, uses 1bit counter * SmallCount, uses a 4bit counter * Count, uses a 8bit counter
Possible choices for the second part: * Table, keep track of kmers * Graph, navigate, tag, etc the de Bruijn graph formed by the k-mers
C++ class name:
Python methods:
k = ksize() - return the k-mer size (a positive integer).
hashval = hash(dna_kmer) - return the result of hashing dna_kmer
, which will be a non-negative integer. len(dna_kmer)
must be exactly the k-mer size. Which hash function is used is dependent on the table type (@document).
dna_kmer = reverse_hash(hashval) - return a DNA string that will hash to hashval
. If there are multiple such strings, return only one. May be unimplemented for particular table types in which case a ValueError will be returned.
sizelist = hashsizes() - return the list of table sizes used in construction.
n = n_unique_kmers() - retrieve an estimate for the number of unique k-mers inserted into the table. Note, this may be order dependent.
n = n_occupied() - retrieve the fraction of bins occupied in the table.
add(dna_kmer_or_hashval) - increment the count associated with either a DNA k-mer or a hashval. Depending on max count for the tabletype and bigcount settings, the count may top out at 1, 15, 255, or 65535. (@CTB add method for retrieving max_count)
count (synonym for add)
get(dna_kmer or hashval) - retrieve the count associated with a DNA k-mer or a hashval.
list_of_strings = get_kmers(seq) - return the list of k-mer strings in the given sequence.
list_of_hashes = get_kmer_hashes(seq) - return the list of the hashed k-mers in the given sequence.
hashset = get_kmer_hashes_as_hashset(seq) - return the hashset of hashed k-mers in the given sequence.
list_of_counts = get_kmer_counts(seq) - return the list of the counts of the k-mers in the given sequence.
save(filename) - save the data to a file on disk.
load(filename) - load the data from a file on disk.
min_count = get_min_count(seq) - return the minimum count for k-mers in seq
med_count, avg_count, stddev_count = get_median_count(seq) - return the median, average, and stddev of the counts for k-mers in the sequence.
max_count = get_max_count(seq) - return the maximum count for k-mers in the sequence.
num_kmers = consume(seq) - count all the k-mers in the given DNA string. @CTB should be consume string
consume_fasta(filename) - count all the k-mers in a (DNA) FASTA/FASTQ file.
consume_fasta_with_reads_parser(khmer.ReadParser object) - count all the k-mers in (DNA) sequences returned from a ReadParser object. This can be used to ensure various forms of pairing are present, etc.
(trim_seq, trim_pos) = trim_on_abundance(seq, abund) - trim the sequence at the first k-mer with an abundance strictly below (<) the provided abundance.
(trim_seq, trim_pos) = trim_below_abundance(seq, abund) - trim the sequence at the first k-mer with an abundance strictly above (>) the provided abundance.
list_of_posns = find_spectral_error_positions(seq, abund) - find potential locations of errors in input sequence, where k-mers are below given abundance.
set_use_bigcount(bool) - some table types (Counttable and Countgraph) support counting past their max value, using a (memory intensive) C++ stl::map. That turns on this “bigcount” behavior. This will raise a ValueError when called on a table type that does not support it.
get_use_bigcount - return the bigcount value (False by default).
dist = abundance_distribution(filename, tracking_obj) - generate an abundance distribution for the k-mers in the given file, using the tracking_obj to avoid double-counting identical k-mers.
dist = abundance_distribution_with_reads_parser(readparser_obj, tracking_obj) - generate an abundance distribution for the k-mers loaded from the given readparser, using the tracking_obj to avoid double-counting identical k-mers.
All the methods of table types, and in addition:
neighbors
calc_connected_graph_size
kmer_degree
count_kmers_within_radius
find_high_degree_nodes
traverse_linear_path
assemble_linear_path
consume_and_tag
get_tags_and_positions
find_all_tags_list
consume_fasta_and_tag
extract_unique_paths
print_tagset
add_tag
get_tagset
load_tagset
save_tagset
n_tags
divide_tags_into_subsets
_get_tag_density
_set_tag_density
do_subset_partition
find_all_tags
assign_partition_id
output_partitions
load_partitionmap
save_partitionmap
_validate_partitionmap
consume_fasta_and_tag_with_reads_parser
consume_partitioned_fasta
merge_subset
merge_subset_from_disk
count_partitions
subset_count_partitions
subset_partition_size_distribution
save_subset_partitionmap
load_subset_partitionmap
_validate_subset_partitionmap
set_partition_id
join_partitions
get_partition_id
repartition_latest_partition
load_stop_tags
save_stop_tags
print_stop_tags
trim_on_stoptags
add_stop_tags
get_stop_tags
Smallcountgraph:
get_raw_tables
Countgraph:
get_raw_tables
do_subset_partition_with_abundance
Nodegraph:
update
get_raw_tables