Who is .com? Learning to Parse WHOIS Records Suqi Liu Ian Foster Stefan Savage
[email protected] [email protected] [email protected] Geoffrey M. Voelker Lawrence K. Saul
[email protected] [email protected] Department of Computer Science and Engineering University of California, San Diego ABSTRACT 1. INTRODUCTION WHOIS is a long-established protocol for querying information about Most common Internet protocols today offer standardized syntax the 280M+ registered domain names on the Internet. Unfortunately, and schemas. Indeed, it is the ability to easily parse and normal- while such records are accessible in a “human-readable” format, ize protocol fields that directly enables a broad array of network they do not follow any consistent schema and thus are challeng- measurement research (e.g., comparing and correlating from dis- ing to analyze at scale. Existing approaches, which rely on manual parate data sources including BGP route tables, TCP flow data and crafting of parsing rules and per-registrar templates, are inherently DNS measurements). By contrast, the WHOIS protocol—the sole limited in coverage and fragile to ongoing changes in data repre- source of information mapping domain names to their rich owner- sentations. In this paper, we develop a statistical model for parsing ship and administrative context—is standard only in its transport WHOIS records that learns from labeled examples. Our model is mechanism, while the format and contents of the registration data a conditional random field (CRF) with a small number of hidden returned varies tremendously among providers. This situation sig- states, a large number of domain-specific features, and parameters nificantly hampers large-scale analyses using WHOIS data, and even that are estimated by efficient dynamic-programming procedures those researchers who do use it commonly document the complexi- for probabilistic inference.