Lookalike domains are internet domain names that are deliberately made to look very similar to a legitimate, trusted domain—often with the goal of deceiving users. Examples include: “tacobe11.com” versus “tacobell.com”, “login-outlook.com” versus “login.outlook.com” or “infoblocks.com” versus “infoblox.com”. They are a common tactic for many types of threats on the internet. These can include scams, credential harvesting, phishing attacks or even for use as innocuous-looking command-and-control (C2) domains.
Infoblox uses many algorithms to detect lookalikes. One challenge in finding lookalikes is compiling a list of target domains that could be used against our customers. We can compile these using various methods, such as creating top queried domain lists, soliciting customer input or even evaluating non-resolved domains for common typos.
However, this list will be far from comprehensive as we may leave the less queried domain on any of our lists. In addition, requiring each new domain to be evaluated against the entire growing target set will slow down processing.
Recently, data scientists at Infoblox Threat Intel decided to prompt a large language model (LLM) to determine the likelihood of a given domain and, if so, what were the most likely target domains. With frontier-level LLMs, the results for the popular domains both for intentional lookalikes and non-lookalikes were very accurate, generally at an accuracy of 91 percent or greater. However, Infoblox takes false positives seriously as they can cause outages, and overload analysts with more alerts.
As we investigated further—particularly when identifying additional lookalikes at the tail end of the possible variations—we found the process to be prone to errors. In fact, target lists could become completely inaccurate. For instance, the benign domain “netgeek.com” was incorrectly considered a lookalike of “netgear.com” or “geeksquad.com”. Careful prompt engineering—including chain-of-thought and reflection techniques—helped addressing these issues. Sometimes, though, the best way to tame hallucination is to go back to traditional methods.
Here additional steps how Infoblox reduces false positives for improved “lookalike domain” detections
Step 1
We use our old algorithms to determine if “netgeek.com” is a lookalike in the target set of “netgear.com” or “geeksquad.com”. Our modified edit distance algorithm did quite well disambiguating both targets from “netgeek.com”. But let’s consider another benign example, “amzn.com”, the LLM helpfully suggested “amazon.com” and here our modified edit distance algorithm still scored “amzn.com” close to “amazon.com”. This might be a problem if we didn’t consider additional context that amzn.com is owned by Amazon, Inc and as such really isn’t something we need to alert or block on.
Step 2
We can look up the registration and SSL/TLS certificate information about both “amzn.com” and “amazon.com”. Here we see “amzn.com” and “amazon.com” have the same name servers, registrars and a common SSL certificate, so we can be confident “amzn.com” is a domain owned by the same entity as “amazon.com”. This means while “amzn.com” does indeed look like “amazon.com” there would be no need to report on the domain.
Infoblox is committed to responsible AI. Part of this process is not taking data generated at face value. By employing these secondary checks, we trust but verify all predictions.
Recent Examples
So far, we have discussed how we avoid hallucinations and potentially misclassify a domain as a lookalike. Now, let’s see how this might help defend against various lookalike threats. In the following table, we give some recent examples of lookalikes we’ve found along with the explanations the LLM generated.
| Domain | Target | Confidence Score (out of 10) | Explanation |
|---|---|---|---|
| infobiox.com | infoblox.com | 8 | This may be a suffix-based lookalike, attempting to get a user to try to log in to an Infoblox portal, but the “x” instead of “l” could be a typo or a legitimate variation. |
| paioaltonetworks.tech | paloaltonetworks.com | 6 | The domain “paioaltonetworks.tech” appears to be a typo or substitution attempt, with “paio” being similar to “palo”, which is a well-known cybersecurity company. However, the use of “.tech” instead of “.com” and the slight misspelling may indicate a legitimate domain or a non-malicious typo. |
| icloudservi.me | icloud.com | 9 | This domain appears to be a lookalike of “icloud.com” with a typo (“servi” instead of “service”) and a different top-level domain (“.me”), which might be an attempt to deceive users into thinking it’s related to Apple’s iCloud services. |
| maps-iphone.cloud | apple.com | 9 | The domain “maps-iphone.cloud” could be a lookalike of Apple’s services by combining “maps” (a service offered by Apple) with “iphone” (a product by Apple) and a trendy top-level domain (“.cloud”), potentially to target iPhone users. |
| goóglé.com | google.com | 9 | The domain contains Unicode characters (“ó” and “é”) that are visually similar to the characters in the well-known domain “google.com”, which could be an attempt to deceive users. |
| onedrive-micrasoft.com | microsoft.com | 8 | This may be a suffix-based lookalike, with a typographical error in “micrasoft” instead of “microsoft”, which is a well-known company and product. |
| coinbase-invoice.com | coinbase.com | 8 | This may be a suffix-based lookalike, attempting to deceive users into thinking it’s an official invoice from Coinbase, a well-known cryptocurrency exchange. |
| login-wellsfargo.com | wellsfargo.com | 9 | This may be a prefix-based lookalike, attempting to get a user to try to log in to a fake Wells Fargo portal. |
Recently, Infoblox Threat Intel did a study on risks related to parked domains and utilized these methods to better understand which legitimate domains the actors were impersonating.
Lessons Learned
It’s interesting to note that the LLM suggested “apple.com” as the target for “maps-iphone.cloud” as this shows that the LLM can detect things that can sometimes not be detected by string similarity techniques. It also points to a potential limitation in mitigating hallucination using our traditional string similarity techniques. Here, we are starting to consider confidence score along with Registration, SSL Certificate information and, in some cases, content.
However, one must be careful trusting the confidence predictions of LLMs as they are token predictors and not regressors. When using sufficient temperature to nominate targets, the confidence score can often vary by up to 3 on a 10-point scale. Additionally, we note that while in all these cases the model correctly identified the target, the explanations can be slightly wrong. For example, the first entry shows a clear example of hallucination. The model’s explanation states that “x” was substituted instead of “l”, when in fact the lookalike domain “infobiox.com” demonstrates an “l” → “i” substitution (“infoblox.com” → “infobiox.com”). While this provides additional evidence to verify the results, it also shows how a wrong model can still be useful if properly constrained.
Why it matters
How does this help protect Infoblox customers? We have now figured out how to make these LLM queries scale to all newly registered domains seen each day. This allows us to protect Infoblox customers against more types of lookalike attacks, including many domains our customers may not consider to be part of their risk profile. We will start by adding this to Zero Day DNS detection and Infoblox Threat Defense feature called “Dossier”, followed by adding processes to add to feed domains.

