Title: | Sequence Clustering with Discrete-Output HMMs |
---|---|
Description: | Provides an implementation of a mixture of hidden Markov models (HMMs) for discrete sequence data in the Discrete Bayesian HMM Clustering (DBHC) algorithm. The DBHC algorithm is an HMM Clustering algorithm that finds a mixture of discrete-output HMMs while using heuristics based on Bayesian Information Criterion (BIC) to search for the optimal number of HMM states and the optimal number of clusters. |
Authors: | Gabriel Budel [aut, cre], Flavius Frasincar [aut] |
Maintainer: | Gabriel Budel <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.0.3 |
Built: | 2025-03-08 04:01:24 UTC |
Source: | https://github.com/gabybudel/dbhc |
Assign sequences to cluster models that give the highest sequence-to-hmm
likelihood. Used in hmm.clust
.
assign.clusters(partition, memberships, sequences, smoothing = 1e-04)
assign.clusters(partition, memberships, sequences, smoothing = 1e-04)
partition |
A list object with the partition, a mixture of HMMs. Each
element in the list is an |
memberships |
A matrix with cluster memberships for each sequence. |
sequences |
An |
smoothing |
Smoothing parameter for absolute discounting in
|
The updated matrix with cluster memberships for each sequence.
Used in main function for the DBHC algorithm
hmm.clust
.
Compute the BIC of a single HMM given a threshold epsilon for counting
parameters. Auxiliary function used in size.search
.
cluster.bic(hmm, eps = 0.001)
cluster.bic(hmm, eps = 0.001)
hmm |
An |
eps |
A threshold epsilon for counting parameters. |
The BIC of hmm
.
Used in size.search
.
Count the number of parameters in an HMM larger than a small number epsilon.
Auxiliary function used in partition.bic
and
cluster.bic
.
count.parameters(hmm, eps = 0.001)
count.parameters(hmm, eps = 0.001)
hmm |
An |
eps |
A threshold epsilon for counting parameters. |
The number of parameters larger than eps
.
Used in partition.bic
and cluster.bic
.
Plots a heatmap of an HMM's emission probabilities.
emission.heatmap(emission, base_size = 10)
emission.heatmap(emission, base_size = 10)
emission |
A matrix with emission probabilities (see also
|
base_size |
Numerical, a size parameter for the plots made using |
See hmm.clust
for an example.
Implementation of the DBHC algorithm, an HMM clustering algorithm that finds a mixture of discrete-output HMMs. The algorithm uses heuristics based on BIC to search for the optimal number of hidden states in each HMM and the optimal number of clusters.
hmm.clust( sequences, id = NULL, smoothing = 1e-04, eps = 0.001, init.size = 2, alphabet = NULL, K.max = NULL, log_space = FALSE, print = FALSE, seed.size = 3 )
hmm.clust( sequences, id = NULL, smoothing = 1e-04, eps = 0.001, init.size = 2, alphabet = NULL, K.max = NULL, log_space = FALSE, print = FALSE, seed.size = 3 )
sequences |
An |
id |
A vector with ids that identify the sequences in |
smoothing |
Smoothing parameter for absolute discounting in
|
eps |
A threshold epsilon for counting parameters in
|
init.size |
The number of HMM states in an initial HMM. |
alphabet |
The alphabet of output labels, if not provided alphabet is
taken from |
K.max |
Maximum number of clusters, if not provided algorithm searches for the optimal number itself. |
log_space |
Logical, parameter provided to
|
print |
Logical, whether to print intermediate steps or not. |
seed.size |
Seed size, the number of sequences to be selected for a seed |
A list with components:
sequences
An
stslist
object of sequences with discrete observations.
id
A vector with ids that identify the sequences in
sequences
.
cluster
A vector with found cluster memberships for the sequences.
partition
A list object with
the partition, a mixture of HMMs. Each element in the list is an hmm
object.
memberships
A matrix with cluster memberships for each sequence.
n.clusters
Numerical, the found number of clusters.
sizes
A vector with the number of HMM states for each cluster model.
bic
A vector with the BICs for each cluster model.
## Simulated data library(seqHMM) output.labels <- c("H", "T") # HMM 1 states.1 <- c("A", "B", "C") transitions.1 <- matrix(c(0.8,0.1,0.1,0.1,0.8,0.1,0.1,0.1,0.8), nrow = 3) rownames(transitions.1) <- states.1 colnames(transitions.1) <- states.1 emissions.1 <- matrix(c(0.5,0.75,0.25,0.5,0.25,0.75), nrow = 3) rownames(emissions.1) <- states.1 colnames(emissions.1) <- output.labels initials.1 <- c(1/3,1/3,1/3) # HMM 2 states.2 <- c("A", "B") transitions.2 <- matrix(c(0.75,0.25,0.25,0.75), nrow = 2) rownames(transitions.2) <- states.2 colnames(transitions.2) <- states.2 emissions.2 <- matrix(c(0.8,0.6,0.2,0.4), nrow = 2) rownames(emissions.2) <- states.2 colnames(emissions.2) <- output.labels initials.2 <- c(0.5,0.5) # Simulate hmm.sim.1 <- simulate_hmm(n_sequences = 100, initial_probs = initials.1, transition_probs = transitions.1, emission_probs = emissions.1, sequence_length = 25) hmm.sim.2 <- simulate_hmm(n_sequences = 100, initial_probs = initials.2, transition_probs = transitions.2, emission_probs = emissions.2, sequence_length = 25) sequences <- rbind(hmm.sim.1$observations, hmm.sim.2$observations) n <- nrow(sequences) # Clustering algorithm id <- paste0("K-", 1:n) rownames(sequences) <- id sequences <- sequences[sample(1:n, n),] res <- hmm.clust(sequences, id = rownames(sequences)) ############################################################################# ## Swiss Household Data data("biofam", package = "TraMineR") # Clustering algorithm new.alphabet <- c("P", "L", "M", "LM", "C", "LC", "LMC", "D") sequences <- seqdef(biofam[,10:25], alphabet = 0:7, states = new.alphabet) ## Not run: res <- hmm.clust(sequences) # Heatmaps cluster <- 1 # display heatmaps for cluster 1 transition.heatmap(res$partition[[cluster]]$transition_probs, res$partition[[cluster]]$initial_probs) emission.heatmap(res$partition[[cluster]]$emission_probs) ## End(Not run) ## A smaller example, which takes less time to run subset <- sequences[sample(1:nrow(sequences), 20, replace = FALSE),] # Clustering algorithm, limiting number of clusters to 2 res <- hmm.clust(subset, K.max = 2) # Number of clusters print(res$n.clusters) # Table of cluster memberships table(res$memberships[,"cluster"]) # BIC for each number of clusters print(res$bic) # Heatmaps cluster <- 1 # display heatmaps for cluster 1 transition.heatmap(res$partition[[cluster]]$transition_probs, res$partition[[cluster]]$initial_probs) emission.heatmap(res$partition[[cluster]]$emission_probs)
## Simulated data library(seqHMM) output.labels <- c("H", "T") # HMM 1 states.1 <- c("A", "B", "C") transitions.1 <- matrix(c(0.8,0.1,0.1,0.1,0.8,0.1,0.1,0.1,0.8), nrow = 3) rownames(transitions.1) <- states.1 colnames(transitions.1) <- states.1 emissions.1 <- matrix(c(0.5,0.75,0.25,0.5,0.25,0.75), nrow = 3) rownames(emissions.1) <- states.1 colnames(emissions.1) <- output.labels initials.1 <- c(1/3,1/3,1/3) # HMM 2 states.2 <- c("A", "B") transitions.2 <- matrix(c(0.75,0.25,0.25,0.75), nrow = 2) rownames(transitions.2) <- states.2 colnames(transitions.2) <- states.2 emissions.2 <- matrix(c(0.8,0.6,0.2,0.4), nrow = 2) rownames(emissions.2) <- states.2 colnames(emissions.2) <- output.labels initials.2 <- c(0.5,0.5) # Simulate hmm.sim.1 <- simulate_hmm(n_sequences = 100, initial_probs = initials.1, transition_probs = transitions.1, emission_probs = emissions.1, sequence_length = 25) hmm.sim.2 <- simulate_hmm(n_sequences = 100, initial_probs = initials.2, transition_probs = transitions.2, emission_probs = emissions.2, sequence_length = 25) sequences <- rbind(hmm.sim.1$observations, hmm.sim.2$observations) n <- nrow(sequences) # Clustering algorithm id <- paste0("K-", 1:n) rownames(sequences) <- id sequences <- sequences[sample(1:n, n),] res <- hmm.clust(sequences, id = rownames(sequences)) ############################################################################# ## Swiss Household Data data("biofam", package = "TraMineR") # Clustering algorithm new.alphabet <- c("P", "L", "M", "LM", "C", "LC", "LMC", "D") sequences <- seqdef(biofam[,10:25], alphabet = 0:7, states = new.alphabet) ## Not run: res <- hmm.clust(sequences) # Heatmaps cluster <- 1 # display heatmaps for cluster 1 transition.heatmap(res$partition[[cluster]]$transition_probs, res$partition[[cluster]]$initial_probs) emission.heatmap(res$partition[[cluster]]$emission_probs) ## End(Not run) ## A smaller example, which takes less time to run subset <- sequences[sample(1:nrow(sequences), 20, replace = FALSE),] # Clustering algorithm, limiting number of clusters to 2 res <- hmm.clust(subset, K.max = 2) # Number of clusters print(res$n.clusters) # Table of cluster memberships table(res$memberships[,"cluster"]) # BIC for each number of clusters print(res$bic) # Heatmaps cluster <- 1 # display heatmaps for cluster 1 transition.heatmap(res$partition[[cluster]]$transition_probs, res$partition[[cluster]]$initial_probs) emission.heatmap(res$partition[[cluster]]$emission_probs)
Get the log likelihood of an HMM object and check if it is feasible (i.e.,
contains no illegal emissions). Auxiliary function used in
partition.bic
.
model.ll(hmm)
model.ll(hmm)
hmm |
An |
The log likelihood of the hmm
object, print warning if model
is infeasible (i.e., if the log likelihood is evaluated for a sequence that
contains emissions that are assigned probability 0 in the hmm
object).
Used in partition.bic
.
Compute the BIC of a partition given a threshold epsilon for counting
parameters. Auxiliary function used in hmm.clust
.
partition.bic(partition, eps = 0.001)
partition.bic(partition, eps = 0.001)
partition |
A list object with the partition of HMMs, a mixture of HMMs. |
eps |
A threshold epsilon for counting parameters in
|
The BIC of the partition.
Used in main function for the DBHC algorithm
hmm.clust
.
Seed selection procedure of the DBHC algorithm, also invokes size search
algorithm for seed in size.search. Used in hmm.clust
.
select.seeds( sequences, log_space = FALSE, K, seed.size = 3, init.size = 2, print = FALSE, smoothing = 1e-04 )
select.seeds( sequences, log_space = FALSE, K, seed.size = 3, init.size = 2, print = FALSE, smoothing = 1e-04 )
sequences |
An |
log_space |
Logical, parameter provided to
|
K |
The number of seeds to select, equal to the number of clusters in a partition. |
seed.size |
Seed size, the number of sequences to be selected for a seed. |
init.size |
The number of HMM states in an initial HMM. |
print |
Logical, whether to print intermediate steps or not. |
smoothing |
Smoothing parameter for absolute discounting in
|
A partition as a list object with HMMs for the selected seeds.
Used in main function for the DBHC algorithm
hmm.clust
.
Compute the sequence-to-HMM likelihood of an HMM evaluated for a single
sequence and check if the sequence contains emissions that are not possible
according to the HMM. Auxiliary function used in select.seeds
and assign.clusters
.
seq2hmm.ll(hmm)
seq2hmm.ll(hmm)
hmm |
An |
The log likelihood of the sequence contained in hmm
, value
will be set to minus infinity if the sequence contains illegal emissions.
Used in select.seeds
and
assign.clusters
.
The size search algorithm finds the optimal number of HMM states for a set of
sequences and returns both the optimal hmm
object and the
corresponding number of hidden states. Used in select.seeds
.
size.search(sequences, log_space = FALSE, print = FALSE)
size.search(sequences, log_space = FALSE, print = FALSE)
sequences |
An |
log_space |
Logical, parameter provided to
|
print |
Logical, whether to print intermediate steps or not. |
A list with the optimal number of HMM states and the optimal
hmm
object.
Used in the DBHC seed selection procedure in
select.seeds
.
Smooth the parameters of an HMM using absolute discounting given a threshold
epsilon. Auxiliary function used in select.seeds
,
assign.clusters
, and hmm.clust
.
smooth.hmm(hmm, smoothing = 1e-04)
smooth.hmm(hmm, smoothing = 1e-04)
hmm |
A raw |
smoothing |
Smoothing parameter for absolute discounting in
|
An hmm
object with smoothed probabilities.
Used in select.seeds
, assign.clusters
,
and main function for the DBHC algorithm hmm.clust
.
Smooth a vector of probabilities using absolute discounting. Auxiliary
function used in smooth.hmm
.
smooth.probabilities(probs, smoothing = 1e-04)
smooth.probabilities(probs, smoothing = 1e-04)
probs |
A vector of raw probabilities. |
smoothing |
Smoothing parameter for absolute discounting. |
A vector of smoothed probabilities.
Used in smooth.hmm
.
Plots a heatmap of an HMM's initial and transition probabilities.
transition.heatmap(transition, initial = NULL, base_size = 10)
transition.heatmap(transition, initial = NULL, base_size = 10)
transition |
A matrix with transition probabilities (see also
|
initial |
An (optional) vector of initial probabilities. |
base_size |
Numerical, a size parameter for the plots made using |
See hmm.clust
for an example.