×

You are using an outdated browser Internet Explorer. It does not support some functions of the site.

Recommend that you install one of the following browsers: Firefox, Opera or Chrome.

Contacts:

+7 961 270-60-01
ivdon3@bk.ru

U-shaped distribution of topic intensity in the latent Dirichlet allocation model: distribution density function and parameter identification method

Abstract

U-shaped distribution of topic intensity in the latent Dirichlet allocation model: distribution density function and parameter identification method

Konnikov E.A.

Incoming article date: 26.06.2025

The article is devoted to the description and mathematical justification of the U-shaped distribution of topic shares that arises in the latent Dirichlet allocation model with symmetric hyperparameters. It is shown that the bimodal shape is due to the reduction of the Dirichlet vector to a beta distribution, which makes traditional unimodal approximations incorrect. A composite probability model is proposed that combines beta, gamma, and Poisson components, as well as covariate accounting for semantic connectivity. The model parameters are determined by the differential evolution method using a criterion that includes the Wasserstein distance and the Jensen–Shannon and Kulbak–Leibler divergences. Based on the corpus of texts from the information field of the Rosatom State Corporation, it has been established that the new model is more accurate than lognormal, Pareto, exponential, and normal approximations, allowing for reliable characterization of thematic flows and supporting decisions in large text data monitoring tasks.

Keywords: system analysis, latent Dirichlet allocation, topic modeling, Dirichlet latent distribution, topic signal intensity, beta distribution, gamma distribution, Poisson process, Jensen–Shannon divergence, Wasserstein distance, Kulbak–Leibler divergence