| Title: | Sequence Clustering with Discrete-Output HMMs |
|---|---|
| Description: | Provides an implementation of a mixture of hidden Markov models (HMMs) for discrete sequence data in the Discrete Bayesian HMM Clustering (DBHC) algorithm. The DBHC algorithm is an HMM Clustering algorithm that finds a mixture of discrete-output HMMs while using heuristics based on Bayesian Information Criterion (BIC) to search for the optimal number of HMM states and the optimal number of clusters. |
| Authors: | Gabriel Budel [aut, cre], Flavius Frasincar [aut] |
| Maintainer: | Gabriel Budel <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 0.0.3 |
| Built: | 2026-05-10 08:31:12 UTC |
| Source: | https://github.com/gabybudel/dbhc |
Assign sequences to cluster models that give the highest sequence-to-hmm
likelihood. Used in hmm.clust.
assign.clusters(partition, memberships, sequences, smoothing = 1e-04)assign.clusters(partition, memberships, sequences, smoothing = 1e-04)
partition |
A list object with the partition, a mixture of HMMs. Each
element in the list is an |
memberships |
A matrix with cluster memberships for each sequence. |
sequences |
An |
smoothing |
Smoothing parameter for absolute discounting in
|
The updated matrix with cluster memberships for each sequence.
Used in main function for the DBHC algorithm
hmm.clust.
Compute the BIC of a single HMM given a threshold epsilon for counting
parameters. Auxiliary function used in size.search.
cluster.bic(hmm, eps = 0.001)cluster.bic(hmm, eps = 0.001)
hmm |
An |
eps |
A threshold epsilon for counting parameters. |
The BIC of hmm.
Used in size.search.
Count the number of parameters in an HMM larger than a small number epsilon.
Auxiliary function used in partition.bic and
cluster.bic.
count.parameters(hmm, eps = 0.001)count.parameters(hmm, eps = 0.001)
hmm |
An |
eps |
A threshold epsilon for counting parameters. |
The number of parameters larger than eps.
Used in partition.bic and cluster.bic.
Plots a heatmap of an HMM's emission probabilities.
emission.heatmap(emission, base_size = 10)emission.heatmap(emission, base_size = 10)
emission |
A matrix with emission probabilities (see also
|
base_size |
Numerical, a size parameter for the plots made using |
See hmm.clust for an example.
Implementation of the DBHC algorithm, an HMM clustering algorithm that finds a mixture of discrete-output HMMs. The algorithm uses heuristics based on BIC to search for the optimal number of hidden states in each HMM and the optimal number of clusters.
hmm.clust( sequences, id = NULL, smoothing = 1e-04, eps = 0.001, init.size = 2, alphabet = NULL, K.max = NULL, log_space = FALSE, print = FALSE, seed.size = 3 )hmm.clust( sequences, id = NULL, smoothing = 1e-04, eps = 0.001, init.size = 2, alphabet = NULL, K.max = NULL, log_space = FALSE, print = FALSE, seed.size = 3 )
sequences |
An |
id |
A vector with ids that identify the sequences in |
smoothing |
Smoothing parameter for absolute discounting in
|
eps |
A threshold epsilon for counting parameters in
|
init.size |
The number of HMM states in an initial HMM. |
alphabet |
The alphabet of output labels, if not provided alphabet is
taken from |
K.max |
Maximum number of clusters, if not provided algorithm searches for the optimal number itself. |
log_space |
Logical, parameter provided to
|
print |
Logical, whether to print intermediate steps or not. |
seed.size |
Seed size, the number of sequences to be selected for a seed |
A list with components:
sequencesAn
stslist object of sequences with discrete observations.
idA vector with ids that identify the sequences in
sequences.
clusterA vector with found cluster memberships for the sequences.
partitionA list object with
the partition, a mixture of HMMs. Each element in the list is an hmm
object.
membershipsA matrix with cluster memberships for each sequence.
n.clustersNumerical, the found number of clusters.
sizesA vector with the number of HMM states for each cluster model.
bicA vector with the BICs for each cluster model.
## Simulated data library(seqHMM) output.labels <- c("H", "T") # HMM 1 states.1 <- c("A", "B", "C") transitions.1 <- matrix(c(0.8,0.1,0.1,0.1,0.8,0.1,0.1,0.1,0.8), nrow = 3) rownames(transitions.1) <- states.1 colnames(transitions.1) <- states.1 emissions.1 <- matrix(c(0.5,0.75,0.25,0.5,0.25,0.75), nrow = 3) rownames(emissions.1) <- states.1 colnames(emissions.1) <- output.labels initials.1 <- c(1/3,1/3,1/3) # HMM 2 states.2 <- c("A", "B") transitions.2 <- matrix(c(0.75,0.25,0.25,0.75), nrow = 2) rownames(transitions.2) <- states.2 colnames(transitions.2) <- states.2 emissions.2 <- matrix(c(0.8,0.6,0.2,0.4), nrow = 2) rownames(emissions.2) <- states.2 colnames(emissions.2) <- output.labels initials.2 <- c(0.5,0.5) # Simulate hmm.sim.1 <- simulate_hmm(n_sequences = 100, initial_probs = initials.1, transition_probs = transitions.1, emission_probs = emissions.1, sequence_length = 25) hmm.sim.2 <- simulate_hmm(n_sequences = 100, initial_probs = initials.2, transition_probs = transitions.2, emission_probs = emissions.2, sequence_length = 25) sequences <- rbind(hmm.sim.1$observations, hmm.sim.2$observations) n <- nrow(sequences) # Clustering algorithm id <- paste0("K-", 1:n) rownames(sequences) <- id sequences <- sequences[sample(1:n, n),] res <- hmm.clust(sequences, id = rownames(sequences)) ############################################################################# ## Swiss Household Data data("biofam", package = "TraMineR") # Clustering algorithm new.alphabet <- c("P", "L", "M", "LM", "C", "LC", "LMC", "D") sequences <- seqdef(biofam[,10:25], alphabet = 0:7, states = new.alphabet) ## Not run: res <- hmm.clust(sequences) # Heatmaps cluster <- 1 # display heatmaps for cluster 1 transition.heatmap(res$partition[[cluster]]$transition_probs, res$partition[[cluster]]$initial_probs) emission.heatmap(res$partition[[cluster]]$emission_probs) ## End(Not run) ## A smaller example, which takes less time to run subset <- sequences[sample(1:nrow(sequences), 20, replace = FALSE),] # Clustering algorithm, limiting number of clusters to 2 res <- hmm.clust(subset, K.max = 2) # Number of clusters print(res$n.clusters) # Table of cluster memberships table(res$memberships[,"cluster"]) # BIC for each number of clusters print(res$bic) # Heatmaps cluster <- 1 # display heatmaps for cluster 1 transition.heatmap(res$partition[[cluster]]$transition_probs, res$partition[[cluster]]$initial_probs) emission.heatmap(res$partition[[cluster]]$emission_probs)## Simulated data library(seqHMM) output.labels <- c("H", "T") # HMM 1 states.1 <- c("A", "B", "C") transitions.1 <- matrix(c(0.8,0.1,0.1,0.1,0.8,0.1,0.1,0.1,0.8), nrow = 3) rownames(transitions.1) <- states.1 colnames(transitions.1) <- states.1 emissions.1 <- matrix(c(0.5,0.75,0.25,0.5,0.25,0.75), nrow = 3) rownames(emissions.1) <- states.1 colnames(emissions.1) <- output.labels initials.1 <- c(1/3,1/3,1/3) # HMM 2 states.2 <- c("A", "B") transitions.2 <- matrix(c(0.75,0.25,0.25,0.75), nrow = 2) rownames(transitions.2) <- states.2 colnames(transitions.2) <- states.2 emissions.2 <- matrix(c(0.8,0.6,0.2,0.4), nrow = 2) rownames(emissions.2) <- states.2 colnames(emissions.2) <- output.labels initials.2 <- c(0.5,0.5) # Simulate hmm.sim.1 <- simulate_hmm(n_sequences = 100, initial_probs = initials.1, transition_probs = transitions.1, emission_probs = emissions.1, sequence_length = 25) hmm.sim.2 <- simulate_hmm(n_sequences = 100, initial_probs = initials.2, transition_probs = transitions.2, emission_probs = emissions.2, sequence_length = 25) sequences <- rbind(hmm.sim.1$observations, hmm.sim.2$observations) n <- nrow(sequences) # Clustering algorithm id <- paste0("K-", 1:n) rownames(sequences) <- id sequences <- sequences[sample(1:n, n),] res <- hmm.clust(sequences, id = rownames(sequences)) ############################################################################# ## Swiss Household Data data("biofam", package = "TraMineR") # Clustering algorithm new.alphabet <- c("P", "L", "M", "LM", "C", "LC", "LMC", "D") sequences <- seqdef(biofam[,10:25], alphabet = 0:7, states = new.alphabet) ## Not run: res <- hmm.clust(sequences) # Heatmaps cluster <- 1 # display heatmaps for cluster 1 transition.heatmap(res$partition[[cluster]]$transition_probs, res$partition[[cluster]]$initial_probs) emission.heatmap(res$partition[[cluster]]$emission_probs) ## End(Not run) ## A smaller example, which takes less time to run subset <- sequences[sample(1:nrow(sequences), 20, replace = FALSE),] # Clustering algorithm, limiting number of clusters to 2 res <- hmm.clust(subset, K.max = 2) # Number of clusters print(res$n.clusters) # Table of cluster memberships table(res$memberships[,"cluster"]) # BIC for each number of clusters print(res$bic) # Heatmaps cluster <- 1 # display heatmaps for cluster 1 transition.heatmap(res$partition[[cluster]]$transition_probs, res$partition[[cluster]]$initial_probs) emission.heatmap(res$partition[[cluster]]$emission_probs)
Get the log likelihood of an HMM object and check if it is feasible (i.e.,
contains no illegal emissions). Auxiliary function used in
partition.bic.
model.ll(hmm)model.ll(hmm)
hmm |
An |
The log likelihood of the hmm object, print warning if model
is infeasible (i.e., if the log likelihood is evaluated for a sequence that
contains emissions that are assigned probability 0 in the hmm
object).
Used in partition.bic.
Compute the BIC of a partition given a threshold epsilon for counting
parameters. Auxiliary function used in hmm.clust.
partition.bic(partition, eps = 0.001)partition.bic(partition, eps = 0.001)
partition |
A list object with the partition of HMMs, a mixture of HMMs. |
eps |
A threshold epsilon for counting parameters in
|
The BIC of the partition.
Used in main function for the DBHC algorithm
hmm.clust.
Seed selection procedure of the DBHC algorithm, also invokes size search
algorithm for seed in size.search. Used in hmm.clust.
select.seeds( sequences, log_space = FALSE, K, seed.size = 3, init.size = 2, print = FALSE, smoothing = 1e-04 )select.seeds( sequences, log_space = FALSE, K, seed.size = 3, init.size = 2, print = FALSE, smoothing = 1e-04 )
sequences |
An |
log_space |
Logical, parameter provided to
|
K |
The number of seeds to select, equal to the number of clusters in a partition. |
seed.size |
Seed size, the number of sequences to be selected for a seed. |
init.size |
The number of HMM states in an initial HMM. |
print |
Logical, whether to print intermediate steps or not. |
smoothing |
Smoothing parameter for absolute discounting in
|
A partition as a list object with HMMs for the selected seeds.
Used in main function for the DBHC algorithm
hmm.clust.
Compute the sequence-to-HMM likelihood of an HMM evaluated for a single
sequence and check if the sequence contains emissions that are not possible
according to the HMM. Auxiliary function used in select.seeds
and assign.clusters.
seq2hmm.ll(hmm)seq2hmm.ll(hmm)
hmm |
An |
The log likelihood of the sequence contained in hmm, value
will be set to minus infinity if the sequence contains illegal emissions.
Used in select.seeds and
assign.clusters.
The size search algorithm finds the optimal number of HMM states for a set of
sequences and returns both the optimal hmm object and the
corresponding number of hidden states. Used in select.seeds.
size.search(sequences, log_space = FALSE, print = FALSE)size.search(sequences, log_space = FALSE, print = FALSE)
sequences |
An |
log_space |
Logical, parameter provided to
|
print |
Logical, whether to print intermediate steps or not. |
A list with the optimal number of HMM states and the optimal
hmm object.
Used in the DBHC seed selection procedure in
select.seeds.
Smooth the parameters of an HMM using absolute discounting given a threshold
epsilon. Auxiliary function used in select.seeds,
assign.clusters, and hmm.clust.
smooth.hmm(hmm, smoothing = 1e-04)smooth.hmm(hmm, smoothing = 1e-04)
hmm |
A raw |
smoothing |
Smoothing parameter for absolute discounting in
|
An hmm object with smoothed probabilities.
Used in select.seeds, assign.clusters,
and main function for the DBHC algorithm hmm.clust.
Smooth a vector of probabilities using absolute discounting. Auxiliary
function used in smooth.hmm.
smooth.probabilities(probs, smoothing = 1e-04)smooth.probabilities(probs, smoothing = 1e-04)
probs |
A vector of raw probabilities. |
smoothing |
Smoothing parameter for absolute discounting. |
A vector of smoothed probabilities.
Used in smooth.hmm.
Plots a heatmap of an HMM's initial and transition probabilities.
transition.heatmap(transition, initial = NULL, base_size = 10)transition.heatmap(transition, initial = NULL, base_size = 10)
transition |
A matrix with transition probabilities (see also
|
initial |
An (optional) vector of initial probabilities. |
base_size |
Numerical, a size parameter for the plots made using |
See hmm.clust for an example.