DGA Detection: Entropy, N-gram Analysis, and Why Dictionaries Matter

Domain generation algorithm pattern analysis

Domain generation algorithms (DGAs) produce domains that look like garbage to human eyes — strings of random-appearing characters that resolve to temporary C2 infrastructure. The obvious detection approach is to measure how "random" a domain name looks and flag anything that exceeds a randomness threshold. This approach works against the naive DGAs of 2009. Modern DGAs use dictionary-based generation, linguistic models, and pronounceable syllable patterns specifically to defeat entropy-based detection. The signal has shifted, and so must the detection methodology.

A Brief Taxonomy of DGA Implementations

DGA implementations fall into four categories based on their generation algorithm, each with different detection signatures:

Character-based DGAs (Type I): Generate domain names by pseudo-randomly selecting characters from a fixed alphabet, typically producing strings like xkqvbrtwz.com. These are the oldest and most detectable type — high Shannon entropy, low n-gram frequency match against natural language letter distributions, and low consonant-vowel alternation regularity. Detection with entropy thresholding works reliably against Type I DGAs and has limited value as a standalone technique because Type I is increasingly rare in current malware families.

Syllable-based DGAs (Type II): Generate domain names by concatenating pronounceable syllables, producing strings like mobaretipux.com. These domains have lower character entropy than Type I because they alternate vowels and consonants with more regularity. Entropy thresholding alone misses many Type II DGAs. Bigram frequency analysis comparing the generated domain against a model trained on legitimate registered domains provides a stronger signal: syllable-based DGAs produce bigram frequencies that are intermediate between random strings and natural language, and distinguishable from both with a well-calibrated model.

Dictionary-based DGAs (Type III): Generate domain names by combining words from an embedded dictionary, producing strings like cloudriverstone.com or bluegatebridge.net. These domains have low character entropy, reasonable n-gram frequency, and plausible linguistic structure — specifically designed to defeat entropy-based detection. Detection requires different signals: domain age (newly registered dictionary-combination domains), DNS resolution failure rates (DGA domains are registered only for a subset of generated names, so most resolve to NXDOMAIN), and TLD selection patterns (DGA-generated dictionary domains show different TLD preferences than legitimate commercial domains).

Wordlist-seeded DGAs (Type IV): Generate domain names by applying a hash or cipher transformation to a combination of words and a date-based seed value. The output is structurally similar to Type III but changes completely on each seeding period. Detection requires identifying the wordlist and algorithm to predict future domains, or relying on behavioral patterns (NXDOMAIN volume, DNS query timing) rather than domain name structure analysis.

Why Entropy Alone Is Insufficient

Shannon entropy of a domain name — computed over the character distribution — provides a single number that measures character randomness. A domain with entropy above approximately 3.8 bits per character (on a 26-character lowercase alphabet) is more likely to be algorithmically generated than human-registered. This threshold works for Type I DGAs and fails for Types II, III, and IV.

The problem is that many legitimate domains also have high character entropy: software version identifiers embedded in hostnames (v17a3z.cdn.malicious-domain.net), UUID-based subdomain generation by CDN providers, and automatically generated service discovery subdomains. Applying a strict entropy threshold in a large enterprise environment produces enough false positives to overwhelm the true positive detections.

The more informative signal than raw entropy is the deviation from language-specific n-gram frequency distributions. English domain names, even technical ones, tend to follow character pair (bigram) and character triple (trigram) frequency distributions that are characteristic of English text. A domain like cloudriverstone.com has plausible English bigram frequencies; a domain like xkqvbrtwz.com has bigram frequencies inconsistent with English. A DGA model trained on legitimate domain bigram distributions flags the latter with high confidence and the former with low confidence — even if their entropy values are similar.

The NXDOMAIN Volume Signal

One of the most reliable behavioral signals for DGA activity is NXDOMAIN response volume from individual hosts. DGA malware queries the full set of generated domain names for the current period, attempting to find the one or few names that the operator has registered. The unregistered names return NXDOMAIN. A single infected host may generate 50-200 NXDOMAIN responses within a 5-minute window before finding the active C2 domain — a volume pattern that is highly anomalous relative to normal application DNS behavior.

The caveat is that some legitimate applications also generate high NXDOMAIN volumes: applications that probe for optional services at startup, misconfigured DNS search suffixes that append incorrect domains to internal hostnames, and browser auto-complete behavior. Baselining NXDOMAIN volume per host and flagging deviations above 3 standard deviations from the host's historical baseline reduces false positives significantly compared to applying an absolute threshold.

The combination of NXDOMAIN volume anomaly with a domain name structure analysis score produces a higher precision signal than either method alone. A host generating 80 NXDOMAIN responses in 5 minutes, where 60 of those domains score above the DGA entropy threshold, is a high-confidence DGA indicator. A host generating 80 NXDOMAIN responses where all domains are legitimate-looking subdomains is more likely to be a misconfigured application than a DGA infection.

Dictionary DGA Detection: The Hard Problem

Dictionary-based DGAs are the current state-of-the-art evasion technique specifically because they defeat the detection approaches that the security industry has standardized on. A domain like mountainbridgecoin.com — three common English words combined — has low entropy, reasonable n-gram statistics, and plausible linguistic structure. It appears identical to a legitimately registered English-language domain to a detection model that looks only at domain name structure.

The practical detection approach for dictionary DGAs relies on signals other than name structure: domain registration age, passive DNS resolution history, certificate transparency log search, and WHOIS age verification. A newly registered dictionary combination domain (registered within 24 hours) that immediately appears in a DNS query from a host that was recently the subject of other suspicious enrichment hits warrants escalation regardless of whether the domain name looks plausible. The combination of novelty and suspicious context is more diagnostic than any structural feature of the domain name itself.

ThreatPulsar integrates certificate transparency log data and passive DNS registration history into its domain enrichment results. When a domain returns as newly registered (under 14 days), recently appearing in CT logs, and queried from a host with other active enrichment hits, the enrichment response includes a DGA hypothesis flag with the supporting evidence. This flag is a soft indicator — not a confident DGA verdict — that prompts analyst review rather than automated response.

Connecting DGA Detection to C2 Infrastructure Analysis

DGA detection is a specific application of the broader C2 infrastructure identification methodology described in our article on clustering threat actor infrastructure. Once a DGA family is identified from NXDOMAIN volume and domain structure analysis, the registered domains from that generation period can be submitted for enrichment to retrieve the full cluster context: what C2 infrastructure is actually hosting the registered domains, what certificate characteristics those IPs present, and whether the hosting infrastructure has been previously associated with known threat actor groups.

The DGA detection outcome that is most operationally useful is not the identification of the malicious domains themselves — those are transient and will be rotated within 24-48 hours — but the identification of the C2 hosting infrastructure that remains active longer than any single domain generation period. Blocking the hosting IP rather than just the domain provides more durable containment value.

Conclusion

DGA detection requires a layered approach that matches the sophistication of current generation algorithms. Entropy thresholding handles Type I DGAs and should remain part of the detection stack for broad coverage. N-gram frequency modeling handles Type II DGAs. Behavioral signals (NXDOMAIN volume, DNS timing patterns) combined with domain lifecycle analysis (registration age, CT log presence) are required for Type III and IV DGAs where name structure analysis is insufficient.

The detection investment should focus on the behavioral and lifecycle signals, because those signals are structurally harder to defeat than name structure evasion. A threat actor can trivially switch from character-based to dictionary-based DGA generation to defeat entropy thresholding. They cannot as easily prevent their freshly registered domains from being young, their infrastructure from showing up in CT logs, or their infected hosts from generating elevated NXDOMAIN volumes during the generation period.

Back to Insights