ShamFinder: An Automated Framework for Detecting IDN Homographs

Authors:
Hiroaki Suzuki

Waseda University, Tokyo, Japan

Waseda University, Tokyo, Japan
View Profile

,
Daiki Chiba

NTT Secure Platform Laboratories, Tokyo, Japan

NTT Secure Platform Laboratories, Tokyo, Japan
View Profile

,
Yoshiro Yoneya

Japan Registry Services, Tokyo, Japan

Japan Registry Services, Tokyo, Japan
View Profile

,
Tatsuya Mori

Waseda University/NICT/RIKEN AIP, Tokyo, Japan

Waseda University/NICT/RIKEN AIP, Tokyo, Japan
View Profile

,
Shigeki Goto

Waseda University, Tokyo, Japan

Waseda University, Tokyo, Japan
View Profile

IMC '19: Proceedings of the Internet Measurement ConferenceOctober 2019Pages 449–462https://doi.org/10.1145/3355369.3355587

Published:21 October 2019Publication History

IMC '19: Proceedings of the Internet Measurement Conference

Pages 449–462

ABSTRACT

The internationalized domain name (IDN) is a mechanism that enables us to use Unicode characters in domain names. The set of Unicode characters contains several pairs of characters that are visually identical with each other; e.g., the Latin character 'a' (U+0061) and Cyrillic character 'a' (U+0430). Visually identical characters such as these are generally known as homoglyphs. IDN homograph attacks, which are widely known, abuse Unicode homoglyphs to create lookalike URLs. Although the threat posed by IDN homograph attacks is not new, the recent rise of IDN adoption in both domain name registries and web browsers has resulted in the threat of these attacks becoming increasingly widespread, leading to large-scale phishing attacks such as those targeting cryptocurrency exchange companies. In this work, we developed a framework named "ShamFinder," which is an automated scheme to detect IDN homographs. Our key contribution is the automatic construction of a homoglyph database, which can be used for direct countermeasures against the attack and to inform users about the context of an IDN homograph. Using the ShamFinder framework, we perform a large-scale measurement study that aims to understand the IDN homographs that exist in the wild. On the basis of our approach, we provide insights into an effective countermeasure against the threats caused by the IDN homograph attack.

References

Pieter Agten, Wouter Joosen, Frank Piessens, and Nick Nikiforakis. 2015. Seven Months' Worth of Mistakes: A Longitudinal Study of Typosquatting Abuse. In Proc. Network and Distributed System Security Symposium (NDSS). http://www.internetsociety.org/doc/seven-months%E2%80%99-worth-mistakes-longitudinal-study-typosquatting-abuseGoogle ScholarCross Ref
Alexa Top Sites [n. d.]. Alexa Top Sites. https://aws.amazon.com/alexa-top-sites/.Google Scholar
Binance. [n. d.]. Summary of the Phishing and Attempted Stealing Incident on Binance. https://support.binance.com/hc/en-us/articles/360001547431.Google Scholar
Daiki Chiba, Mitsuaki Akiyama, Takeshi Yagi, Kunio Hato, Tatsuya Mori, and Shigeki Goto. 2018. DomainChroma: Building actionable threat intelligence from malicious domain names. Computers & Security 77 (2018), 138--161. https://doi.org/10.1016/j.cose.2018.03.013Google ScholarCross Ref
Daiki Chiba, Ayako Akiyama Hasegawa, Takashi Koide, Yuta Sawabe, Shigeki Goto, and Mitsuaki Akiyama. 2019. DomainScouter: Understanding the Risks of Deceptive IDNs. In Proc. Research in Attacks, Intrusions and Defenses (RAID). 413--426.Google Scholar
Adam M. Costello. 2003. Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA). RFC 3492. https://doi.org/10.17487/RFC3492Google Scholar
DOMAINLISTS.IO. [n. d.]. Lists of all domains updated daily. https://domainlists.io/.Google Scholar
Patrik Fältström. 2019. IDNA2008 and Unicode 12.0.0. Internet-Draft draft-faltstrom-unicode12. Internet Engineering Task Force. https://tools.ietf.org/html/draft-faltstrom-unicode12-00 Work in Progress.Google Scholar
Farsight DNSDB [n. d.]. Farsight DNSDB. https://www.farsightsecurity.com/solutions/dnsdb/.Google Scholar
Patrik Fältström. 2010. The Unicode Code Points and Internationalized Domain Names for Applications (IDNA). RFC 5892. https://doi.org/10.17487/RFC5892Google Scholar
Patrik Fältström and Paul E. Hoffman. 2003. Internationalizing Domain Names in Applications (IDNA). RFC 3490. https://doi.org/10.17487/RFC3490Google Scholar
Evgeniy Gabrilovich and Alex Gontmakher. 2002. The homograph attack. Commun. ACM 45, 2 (2002), 128.Google ScholarDigital Library
Google Noto Fonts [n. d.]. Google Noto Fonts. https://www.google.com/get/noto/.Google Scholar
Google Safe Browsing [n. d.]. Google Safe Browsing. https://developers.google.com/safe-browsing/.Google Scholar
Tobias Holgers, David E. Watson, and Steven D. Gribble. 2006. Cutting through the Confusion: A Measurement Study of Homograph Attacks. In Proc. USENIX Annual Technical Conference (ATC). 261--266. http://www.usenix.org/events/usenix06/tech/holgers.htmlGoogle Scholar
Alain Horé and Djemel Ziou. 2010. Image Quality Metrics: PSNR vs. SSIM. In Proc. Int. Conf. Pattern Recognition (ICPR). 2366--2369.Google ScholarDigital Library
hpHosts [n. d.]. hpHosts. http://www.hosts-file.net/.Google Scholar
IDN World Report. [n. d.]. IDN Totals by Year. https://idnworldreport.eu/charts/idn-totals-by-year/.Google Scholar
Internationalization of Domain Names. [n. d.]. https://tools.ietf.org/html/draft-duerst-dns-i18n-00.Google Scholar
langid.py [n. d.]. langid.py. https://github.com/saffsd/langid.py.Google Scholar
Baojun Liu, Chaoyi Lu, Zhou Li, Ying Liu, Haixin Duan, Shuang Hao, and Zaifeng Zhang. 2018. A Reexamination of Internationalized Domain Names: The Good, the Bad and the Ugly. In Proc. IEEE/IFIP Dependable Systems and Networks (DSN). 654--665.Google ScholarCross Ref
Majestic Million [n. d.]. Majestic Million. https://majestic.com/reports/majestic-million.Google Scholar
Mozilla. [n. d.]. IDN Display Algorithm. https://wiki.mozilla.org/IDN_Display_Algorithm.Google Scholar
Victor Le Pochat, Tom van Goethem, and Wouter Joosen. 2019. Funny Accents: Exploring Genuine Interest in Internationalized Domain Names. In Proc. Passive and Active Measurement Conference (PAM). 178--194. https://doi.org/10.1007/978-3-030-15986-3_12Google ScholarDigital Library
Puppeteer [n. d.]. Puppeteer. https://pptr.dev/.Google Scholar
Florian Quinkert, Tobias Lauinger, William Robertson, Engin Kirda, and Thorsten Holz. 2019. It's Not What It Looks Like: Measuring Attacks and Defensive Registrations of Homograph Domains. In Proc. IEEE Conf. Communications and Network Security (CNS). 259--267.Google ScholarCross Ref
Repository of IDN Practices. [n. d.]. https://www.icann.org/resources/pages/idn-guidelines-2003-06-20-en.Google Scholar
Repository of IDN Practices. [n. d.]. https://www.iana.org/domains/idn-tables.Google Scholar
Walter Rweyemamu, Tobias Lauinger, Christo Wilson, William K. Robertson, and Engin Kirda. 2019. Clustering and the Weekend Effect: Recommendations for the Use of Top Domain Lists in Security Research. In Proc. Passive and Active Measurement Conference (PAM). 161--177. https://doi.org/10.1007/978-3-030-15986-3_11Google ScholarDigital Library
Yuta Sawabe, Daiki Chiba, Mitsuaki Akiyama, and Shigeki Goto. 2019. Detection Method of Homograph Internationalized Domain Names with OCR. Journal of Information Processing (JIP) 27, 5 (2019).Google Scholar
Quirin Scheitle, Oliver Hohlfeld, Julien Gamba, Jonas Jelten, Torsten Zimmermann, Stephen D. Strowes, and Narseo Vallina-Rodriguez. 2018. A Long Way to the Top: Significance, Structure, and Stability of Internet Top Lists. In Proc. ACM Internet Measurement Conference (IMC). 478--493. https://dl.acm.org/citation.cfm?id=3278574Google ScholarDigital Library
shamfinder [n. d.]. shamfinder. https://github.com/shamfinder/shamfinder.Google Scholar
Symantec. [n. d.]. DeepSight Intelligence. https://www.symantec.com/services/cyber-security-services/deepsight-intelligence.Google Scholar
Janos Szurdi, Balazs Kocso, Gabor Cseh, Jonathan Spring, Márk Félegyházi, and Chris Kanich. 2014. The Long "Taile" of Typosquatting Domain Names. In Proc. USENIX Security Symposium. 191--206. https://www.usenix.org/conference/usenixsecurity14/technical-sessions/presentation/szurdiGoogle Scholar
The Chromium Projects. [n. d.]. IDN in Google Chrome. https://www.chromium.org/developers/design-documents/idn-in-google-chrome.Google Scholar
The Unicode Consortium. [n. d.]. Confusables Data Collection. http://unicode.org/reports/tr39/.Google Scholar
Ke Tian, Steve T. K. Jan, Hang Hu, Danfeng Yao, and Gang Wang. 2018. Needle in a Haystack: Tracking Down Elite Phishing Domains in the Wild. In Proc. ACM Internet Measurement Conference (IMC). 429--442. https://dl.acm.org/citation.cfm?id=3278569Google ScholarDigital Library
Unicode fonts [n. d.]. Unicode fonts. https://en.wikipedia.org/wiki/List_of_typefaces.Google Scholar
Unicode Inc. [n. d.]. Unicode 12.0.0. http://unicode.org/versions/Unicode12.0.0/.Google Scholar
Unifoundry.com. [n. d.]. http://unifoundry.com/unifont/index.html.Google Scholar
U.S. Department of Labor. [n. d.]. Minimum Wage Laws in the States. https://www.dol.gov/whd/minwage/america.htm.Google Scholar
Verisign. [n. d.]. Top-Level Domain Zone File Information. https://www.verisign.com/en_US/channel-resources/domain-registry-products/zone-file/index.xhtml.Google Scholar
VirusTotal [n. d.]. VirusTotal. https://www.virustotal.com/.Google Scholar
Thomas Vissers, Wouter Joosen, and Nick Nikiforakis. 2015. Parking Sensors: Analyzing and Detecting Parked Domains. In Proc. Network and Distributed System Security Symposium (NDSS). https://www.ndss-symposium.org/ndss2015/parking-sensors-analyzing-and-detecting-parked-domainsGoogle ScholarCross Ref
Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Processing 13, 4 (2004), 600--612. https://doi.org/10.1109/TIP.2003.819861Google ScholarDigital Library
Xudong Zheng. 2017. Phishing with Unicode Domains. https://www.xudongz.com/blog/2017/idn-phishing/.Google Scholar

Index Terms

ShamFinder: An Automated Framework for Detecting IDN Homographs
1. Security and privacy
  1. Network security

Recommendations

Two template matching approaches to Arabic, Amharic and Latin isolated characters recognition

With the establishment of commercial OCR systems for Latin text, recent research efforts have been directed at the design of recognition systems for non-Latin scripts, such as Japanese, Cyrillic, Chinese, Hindi, Tibetan, and in particular Arabic. The ...
Read More
RFC 5892: The Unicode Code Points and Internationalized Domain Names for Applications (IDNA)
Read More
Development of an Assamese OCR using Bangla OCR
DAR '12: Proceeding of the workshop on Document Analysis and Recognition

This paper refers to the development of an OCR for the Assamese language by modifying an existing OCR for the Bangla language. This modification is feasible because the Assamese script is similar, except for a few characters, to the Bangla script. The ...
Read More

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

IMC '19: Proceedings of the Internet Measurement Conference
October 2019
497 pages
ISBN:9781450369480
DOI:10.1145/3355369

Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 October 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
DNS
Homoglyph
IDN homograph
Unicode
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
IMC '19 Paper Acceptance Rate39of197submissions,20%Overall Acceptance Rate277of1,083submissions,26%
More
Upcoming Conference
IMC '24

Sponsor:

sigcomm

sigcomm

ACM Internet Measurement Conference

November 4 - 6, 2024

Madrid , AA , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 18
  Total Citations
  View Citations
- 2,642
  Total Downloads
- Downloads (Last 12 months)1,807
- Downloads (Last 6 weeks)326
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

ShamFinder: An Automated Framework for Detecting IDN Homographs

IMC '19: Proceedings of the Internet Measurement Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Two template matching approaches to Arabic, Amharic and Latin isolated characters recognition

RFC 5892: The Unicode Code Points and Internationalized Domain Names for Applications (IDNA)

Development of an Assamese OCR using Bangla OCR