/ lib / wind / rfc3490.txt
rfc3490.txt
   1  
   2  
   3  
   4  
   5  
   6  
   7  Network Working Group                                       P. Faltstrom
   8  Request for Comments: 3490                                         Cisco
   9  Category: Standards Track                                     P. Hoffman
  10                                                                IMC & VPNC
  11                                                               A. Costello
  12                                                               UC Berkeley
  13                                                                March 2003
  14  
  15  
  16           Internationalizing Domain Names in Applications (IDNA)
  17  
  18  Status of this Memo
  19  
  20     This document specifies an Internet standards track protocol for the
  21     Internet community, and requests discussion and suggestions for
  22     improvements.  Please refer to the current edition of the "Internet
  23     Official Protocol Standards" (STD 1) for the standardization state
  24     and status of this protocol.  Distribution of this memo is unlimited.
  25  
  26  Copyright Notice
  27  
  28     Copyright (C) The Internet Society (2003).  All Rights Reserved.
  29  
  30  Abstract
  31  
  32     Until now, there has been no standard method for domain names to use
  33     characters outside the ASCII repertoire.  This document defines
  34     internationalized domain names (IDNs) and a mechanism called
  35     Internationalizing Domain Names in Applications (IDNA) for handling
  36     them in a standard fashion.  IDNs use characters drawn from a large
  37     repertoire (Unicode), but IDNA allows the non-ASCII characters to be
  38     represented using only the ASCII characters already allowed in so-
  39     called host names today.  This backward-compatible representation is
  40     required in existing protocols like DNS, so that IDNs can be
  41     introduced with no changes to the existing infrastructure.  IDNA is
  42     only meant for processing domain names, not free text.
  43  
  44  Table of Contents
  45  
  46     1. Introduction..................................................  2
  47        1.1 Problem Statement.........................................  3
  48        1.2 Limitations of IDNA.......................................  3
  49        1.3 Brief overview for application developers.................  4
  50     2. Terminology...................................................  5
  51     3. Requirements and applicability................................  7
  52        3.1 Requirements..............................................  7
  53        3.2 Applicability.............................................  8
  54           3.2.1. DNS resource records................................  8
  55  
  56  
  57  
  58  Faltstrom, et al.           Standards Track                     [Page 1]
  59  
  60  RFC 3490                          IDNA                        March 2003
  61  
  62  
  63           3.2.2. Non-domain-name data types stored in domain names...  9
  64     4. Conversion operations.........................................  9
  65        4.1 ToASCII................................................... 10
  66        4.2 ToUnicode................................................. 11
  67     5. ACE prefix.................................................... 12
  68     6. Implications for typical applications using DNS............... 13
  69        6.1 Entry and display in applications......................... 14
  70        6.2 Applications and resolver libraries....................... 15
  71        6.3 DNS servers............................................... 15
  72        6.4 Avoiding exposing users to the raw ACE encoding........... 16
  73        6.5  DNSSEC authentication of IDN domain names................ 16
  74     7. Name server considerations.................................... 17
  75     8. Root server considerations.................................... 17
  76     9. References.................................................... 18
  77        9.1 Normative References...................................... 18
  78        9.2 Informative References.................................... 18
  79     10. Security Considerations...................................... 19
  80     11. IANA Considerations.......................................... 20
  81     12. Authors' Addresses........................................... 21
  82     13. Full Copyright Statement..................................... 22
  83  
  84  1. Introduction
  85  
  86     IDNA works by allowing applications to use certain ASCII name labels
  87     (beginning with a special prefix) to represent non-ASCII name labels.
  88     Lower-layer protocols need not be aware of this; therefore IDNA does
  89     not depend on changes to any infrastructure.  In particular, IDNA
  90     does not depend on any changes to DNS servers, resolvers, or protocol
  91     elements, because the ASCII name service provided by the existing DNS
  92     is entirely sufficient for IDNA.
  93  
  94     This document does not require any applications to conform to IDNA,
  95     but applications can elect to use IDNA in order to support IDN while
  96     maintaining interoperability with existing infrastructure.  If an
  97     application wants to use non-ASCII characters in domain names, IDNA
  98     is the only currently-defined option.  Adding IDNA support to an
  99     existing application entails changes to the application only, and
 100     leaves room for flexibility in the user interface.
 101  
 102     A great deal of the discussion of IDN solutions has focused on
 103     transition issues and how IDN will work in a world where not all of
 104     the components have been updated.  Proposals that were not chosen by
 105     the IDN Working Group would depend on user applications, resolvers,
 106     and DNS servers being updated in order for a user to use an
 107     internationalized domain name.  Rather than rely on widespread
 108     updating of all components, IDNA depends on updates to user
 109     applications only; no changes are needed to the DNS protocol or any
 110     DNS servers or the resolvers on user's computers.
 111  
 112  
 113  
 114  Faltstrom, et al.           Standards Track                     [Page 2]
 115  
 116  RFC 3490                          IDNA                        March 2003
 117  
 118  
 119  1.1 Problem Statement
 120  
 121     The IDNA specification solves the problem of extending the repertoire
 122     of characters that can be used in domain names to include the Unicode
 123     repertoire (with some restrictions).
 124  
 125     IDNA does not extend the service offered by DNS to the applications.
 126     Instead, the applications (and, by implication, the users) continue
 127     to see an exact-match lookup service.  Either there is a single
 128     exactly-matching name or there is no match.  This model has served
 129     the existing applications well, but it requires, with or without
 130     internationalized domain names, that users know the exact spelling of
 131     the domain names that the users type into applications such as web
 132     browsers and mail user agents.  The introduction of the larger
 133     repertoire of characters potentially makes the set of misspellings
 134     larger, especially given that in some cases the same appearance, for
 135     example on a business card, might visually match several Unicode code
 136     points or several sequences of code points.
 137  
 138     IDNA allows the graceful introduction of IDNs not only by avoiding
 139     upgrades to existing infrastructure (such as DNS servers and mail
 140     transport agents), but also by allowing some rudimentary use of IDNs
 141     in applications by using the ASCII representation of the non-ASCII
 142     name labels.  While such names are very user-unfriendly to read and
 143     type, and hence are not suitable for user input, they allow (for
 144     instance) replying to email and clicking on URLs even though the
 145     domain name displayed is incomprehensible to the user.  In order to
 146     allow user-friendly input and output of the IDNs, the applications
 147     need to be modified to conform to this specification.
 148  
 149     IDNA uses the Unicode character repertoire, which avoids the
 150     significant delays that would be inherent in waiting for a different
 151     and specific character set be defined for IDN purposes by some other
 152     standards developing organization.
 153  
 154  1.2 Limitations of IDNA
 155  
 156     The IDNA protocol does not solve all linguistic issues with users
 157     inputting names in different scripts.  Many important language-based
 158     and script-based mappings are not covered in IDNA and need to be
 159     handled outside the protocol.  For example, names that are entered in
 160     a mix of traditional and simplified Chinese characters will not be
 161     mapped to a single canonical name.  Another example is Scandinavian
 162     names that are entered with U+00F6 (LATIN SMALL LETTER O WITH
 163     DIAERESIS) will not be mapped to U+00F8 (LATIN SMALL LETTER O WITH
 164     STROKE).
 165  
 166  
 167  
 168  
 169  
 170  Faltstrom, et al.           Standards Track                     [Page 3]
 171  
 172  RFC 3490                          IDNA                        March 2003
 173  
 174  
 175     An example of an important issue that is not considered in detail in
 176     IDNA is how to provide a high probability that a user who is entering
 177     a domain name based on visual information (such as from a business
 178     card or billboard) or aural information (such as from a telephone or
 179     radio) would correctly enter the IDN.  Similar issues exist for ASCII
 180     domain names, for example the possible visual confusion between the
 181     letter 'O' and the digit zero, but the introduction of the larger
 182     repertoire of characters creates more opportunities of similar
 183     looking and similar sounding names.  Note that this is a complex
 184     issue relating to languages, input methods on computers, and so on.
 185     Furthermore, the kind of matching and searching necessary for a high
 186     probability of success would not fit the role of the DNS and its
 187     exact matching function.
 188  
 189  1.3 Brief overview for application developers
 190  
 191     Applications can use IDNA to support internationalized domain names
 192     anywhere that ASCII domain names are already supported, including DNS
 193     master files and resolver interfaces.  (Applications can also define
 194     protocols and interfaces that support IDNs directly using non-ASCII
 195     representations.  IDNA does not prescribe any particular
 196     representation for new protocols, but it still defines which names
 197     are valid and how they are compared.)
 198  
 199     The IDNA protocol is contained completely within applications.  It is
 200     not a client-server or peer-to-peer protocol: everything is done
 201     inside the application itself.  When used with a DNS resolver
 202     library, IDNA is inserted as a "shim" between the application and the
 203     resolver library.  When used for writing names into a DNS zone, IDNA
 204     is used just before the name is committed to the zone.
 205  
 206     There are two operations described in section 4 of this document:
 207  
 208     -  The ToASCII operation is used before sending an IDN to something
 209        that expects ASCII names (such as a resolver) or writing an IDN
 210        into a place that expects ASCII names (such as a DNS master file).
 211  
 212     -  The ToUnicode operation is used when displaying names to users,
 213        for example names obtained from a DNS zone.
 214  
 215     It is important to note that the ToASCII operation can fail.  If it
 216     fails when processing a domain name, that domain name cannot be used
 217     as an internationalized domain name and the application has to have
 218     some method of dealing with this failure.
 219  
 220     IDNA requires that implementations process input strings with
 221     Nameprep [NAMEPREP], which is a profile of Stringprep [STRINGPREP],
 222     and then with Punycode [PUNYCODE].  Implementations of IDNA MUST
 223  
 224  
 225  
 226  Faltstrom, et al.           Standards Track                     [Page 4]
 227  
 228  RFC 3490                          IDNA                        March 2003
 229  
 230  
 231     fully implement Nameprep and Punycode; neither Nameprep nor Punycode
 232     are optional.
 233  
 234  2. Terminology
 235  
 236     The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED",
 237     and "MAY" in this document are to be interpreted as described in BCP
 238     14, RFC 2119 [RFC2119].
 239  
 240     A code point is an integer value associated with a character in a
 241     coded character set.
 242  
 243     Unicode [UNICODE] is a coded character set containing tens of
 244     thousands of characters.  A single Unicode code point is denoted by
 245     "U+" followed by four to six hexadecimal digits, while a range of
 246     Unicode code points is denoted by two hexadecimal numbers separated
 247     by "..", with no prefixes.
 248  
 249     ASCII means US-ASCII [USASCII], a coded character set containing 128
 250     characters associated with code points in the range 0..7F.  Unicode
 251     is an extension of ASCII: it includes all the ASCII characters and
 252     associates them with the same code points.
 253  
 254     The term "LDH code points" is defined in this document to mean the
 255     code points associated with ASCII letters, digits, and the hyphen-
 256     minus; that is, U+002D, 30..39, 41..5A, and 61..7A. "LDH" is an
 257     abbreviation for "letters, digits, hyphen".
 258  
 259     [STD13] talks about "domain names" and "host names", but many people
 260     use the terms interchangeably.  Further, because [STD13] was not
 261     terribly clear, many people who are sure they know the exact
 262     definitions of each of these terms disagree on the definitions.  In
 263     this document the term "domain name" is used in general.  This
 264     document explicitly cites [STD3] whenever referring to the host name
 265     syntax restrictions defined therein.
 266  
 267     A label is an individual part of a domain name.  Labels are usually
 268     shown separated by dots; for example, the domain name
 269     "www.example.com" is composed of three labels: "www", "example", and
 270     "com".  (The zero-length root label described in [STD13], which can
 271     be explicit as in "www.example.com." or implicit as in
 272     "www.example.com", is not considered a label in this specification.)
 273     IDNA extends the set of usable characters in labels that are text.
 274     For the rest of this document, the term "label" is shorthand for
 275     "text label", and "every label" means "every text label".
 276  
 277  
 278  
 279  
 280  
 281  
 282  Faltstrom, et al.           Standards Track                     [Page 5]
 283  
 284  RFC 3490                          IDNA                        March 2003
 285  
 286  
 287     An "internationalized label" is a label to which the ToASCII
 288     operation (see section 4) can be applied without failing (with the
 289     UseSTD3ASCIIRules flag unset).  This implies that every ASCII label
 290     that satisfies the [STD13] length restriction is an internationalized
 291     label.  Therefore the term "internationalized label" is a
 292     generalization, embracing both old ASCII labels and new non-ASCII
 293     labels.  Although most Unicode characters can appear in
 294     internationalized labels, ToASCII will fail for some input strings,
 295     and such strings are not valid internationalized labels.
 296  
 297     An "internationalized domain name" (IDN) is a domain name in which
 298     every label is an internationalized label.  This implies that every
 299     ASCII domain name is an IDN (which implies that it is possible for a
 300     name to be an IDN without it containing any non-ASCII characters).
 301     This document does not attempt to define an "internationalized host
 302     name".  Just as has been the case with ASCII names, some DNS zone
 303     administrators may impose restrictions, beyond those imposed by DNS
 304     or IDNA, on the characters or strings that may be registered as
 305     labels in their zones.  Such restrictions have no impact on the
 306     syntax or semantics of DNS protocol messages; a query for a name that
 307     matches no records will yield the same response regardless of the
 308     reason why it is not in the zone.  Clients issuing queries or
 309     interpreting responses cannot be assumed to have any knowledge of
 310     zone-specific restrictions or conventions.
 311  
 312     In IDNA, equivalence of labels is defined in terms of the ToASCII
 313     operation, which constructs an ASCII form for a given label, whether
 314     or not the label was already an ASCII label.  Labels are defined to
 315     be equivalent if and only if their ASCII forms produced by ToASCII
 316     match using a case-insensitive ASCII comparison.  ASCII labels
 317     already have a notion of equivalence: upper case and lower case are
 318     considered equivalent.  The IDNA notion of equivalence is an
 319     extension of that older notion.  Equivalent labels in IDNA are
 320     treated as alternate forms of the same label, just as "foo" and "Foo"
 321     are treated as alternate forms of the same label.
 322  
 323     To allow internationalized labels to be handled by existing
 324     applications, IDNA uses an "ACE label" (ACE stands for ASCII
 325     Compatible Encoding).  An ACE label is an internationalized label
 326     that can be rendered in ASCII and is equivalent to an
 327     internationalized label that cannot be rendered in ASCII.  Given any
 328     internationalized label that cannot be rendered in ASCII, the ToASCII
 329     operation will convert it to an equivalent ACE label (whereas an
 330     ASCII label will be left unaltered by ToASCII).  ACE labels are
 331     unsuitable for display to users.  The ToUnicode operation will
 332     convert any label to an equivalent non-ACE label.  In fact, an ACE
 333     label is formally defined to be any label that the ToUnicode
 334     operation would alter (whereas non-ACE labels are left unaltered by
 335  
 336  
 337  
 338  Faltstrom, et al.           Standards Track                     [Page 6]
 339  
 340  RFC 3490                          IDNA                        March 2003
 341  
 342  
 343     ToUnicode).  Every ACE label begins with the ACE prefix specified in
 344     section 5.  The ToASCII and ToUnicode operations are specified in
 345     section 4.
 346  
 347     The "ACE prefix" is defined in this document to be a string of ASCII
 348     characters that appears at the beginning of every ACE label.  It is
 349     specified in section 5.
 350  
 351     A "domain name slot" is defined in this document to be a protocol
 352     element or a function argument or a return value (and so on)
 353     explicitly designated for carrying a domain name.  Examples of domain
 354     name slots include: the QNAME field of a DNS query; the name argument
 355     of the gethostbyname() library function; the part of an email address
 356     following the at-sign (@) in the From: field of an email message
 357     header; and the host portion of the URI in the src attribute of an
 358     HTML <IMG> tag.  General text that just happens to contain a domain
 359     name is not a domain name slot; for example, a domain name appearing
 360     in the plain text body of an email message is not occupying a domain
 361     name slot.
 362  
 363     An "IDN-aware domain name slot" is defined in this document to be a
 364     domain name slot explicitly designated for carrying an
 365     internationalized domain name as defined in this document.  The
 366     designation may be static (for example, in the specification of the
 367     protocol or interface) or dynamic (for example, as a result of
 368     negotiation in an interactive session).
 369  
 370     An "IDN-unaware domain name slot" is defined in this document to be
 371     any domain name slot that is not an IDN-aware domain name slot.
 372     Obviously, this includes any domain name slot whose specification
 373     predates IDNA.
 374  
 375  3. Requirements and applicability
 376  
 377  3.1 Requirements
 378  
 379     IDNA conformance means adherence to the following four requirements:
 380  
 381     1) Whenever dots are used as label separators, the following
 382        characters MUST be recognized as dots: U+002E (full stop), U+3002
 383        (ideographic full stop), U+FF0E (fullwidth full stop), U+FF61
 384        (halfwidth ideographic full stop).
 385  
 386     2) Whenever a domain name is put into an IDN-unaware domain name slot
 387        (see section 2), it MUST contain only ASCII characters.  Given an
 388        internationalized domain name (IDN), an equivalent domain name
 389        satisfying this requirement can be obtained by applying the
 390  
 391  
 392  
 393  
 394  Faltstrom, et al.           Standards Track                     [Page 7]
 395  
 396  RFC 3490                          IDNA                        March 2003
 397  
 398  
 399        ToASCII operation (see section 4) to each label and, if dots are
 400        used as label separators, changing all the label separators to
 401        U+002E.
 402  
 403     3) ACE labels obtained from domain name slots SHOULD be hidden from
 404        users when it is known that the environment can handle the non-ACE
 405        form, except when the ACE form is explicitly requested.  When it
 406        is not known whether or not the environment can handle the non-ACE
 407        form, the application MAY use the non-ACE form (which might fail,
 408        such as by not being displayed properly), or it MAY use the ACE
 409        form (which will look unintelligle to the user).  Given an
 410        internationalized domain name, an equivalent domain name
 411        containing no ACE labels can be obtained by applying the ToUnicode
 412        operation (see section 4) to each label.  When requirements 2 and
 413        3 both apply, requirement 2 takes precedence.
 414  
 415     4) Whenever two labels are compared, they MUST be considered to match
 416        if and only if they are equivalent, that is, their ASCII forms
 417        (obtained by applying ToASCII) match using a case-insensitive
 418        ASCII comparison.  Whenever two names are compared, they MUST be
 419        considered to match if and only if their corresponding labels
 420        match, regardless of whether the names use the same forms of label
 421        separators.
 422  
 423  3.2 Applicability
 424  
 425     IDNA is applicable to all domain names in all domain name slots
 426     except where it is explicitly excluded.
 427  
 428     This implies that IDNA is applicable to many protocols that predate
 429     IDNA.  Note that IDNs occupying domain name slots in those protocols
 430     MUST be in ASCII form (see section 3.1, requirement 2).
 431  
 432  3.2.1. DNS resource records
 433  
 434     IDNA does not apply to domain names in the NAME and RDATA fields of
 435     DNS resource records whose CLASS is not IN.  This exclusion applies
 436     to every non-IN class, present and future, except where future
 437     standards override this exclusion by explicitly inviting the use of
 438     IDNA.
 439  
 440     There are currently no other exclusions on the applicability of IDNA
 441     to DNS resource records; it depends entirely on the CLASS, and not on
 442     the TYPE.  This will remain true, even as new types are defined,
 443     unless there is a compelling reason for a new type to complicate
 444     matters by imposing type-specific rules.
 445  
 446  
 447  
 448  
 449  
 450  Faltstrom, et al.           Standards Track                     [Page 8]
 451  
 452  RFC 3490                          IDNA                        March 2003
 453  
 454  
 455  3.2.2. Non-domain-name data types stored in domain names
 456  
 457     Although IDNA enables the representation of non-ASCII characters in
 458     domain names, that does not imply that IDNA enables the
 459     representation of non-ASCII characters in other data types that are
 460     stored in domain names.  For example, an email address local part is
 461     sometimes stored in a domain label (hostmaster@example.com would be
 462     represented as hostmaster.example.com in the RDATA field of an SOA
 463     record).  IDNA does not update the existing email standards, which
 464     allow only ASCII characters in local parts.  Therefore, unless the
 465     email standards are revised to invite the use of IDNA for local
 466     parts, a domain label that holds the local part of an email address
 467     SHOULD NOT begin with the ACE prefix, and even if it does, it is to
 468     be interpreted literally as a local part that happens to begin with
 469     the ACE prefix.
 470  
 471  4. Conversion operations
 472  
 473     An application converts a domain name put into an IDN-unaware slot or
 474     displayed to a user.  This section specifies the steps to perform in
 475     the conversion, and the ToASCII and ToUnicode operations.
 476  
 477     The input to ToASCII or ToUnicode is a single label that is a
 478     sequence of Unicode code points (remember that all ASCII code points
 479     are also Unicode code points).  If a domain name is represented using
 480     a character set other than Unicode or US-ASCII, it will first need to
 481     be transcoded to Unicode.
 482  
 483     Starting from a whole domain name, the steps that an application
 484     takes to do the conversions are:
 485  
 486     1) Decide whether the domain name is a "stored string" or a "query
 487        string" as described in [STRINGPREP].  If this conversion follows
 488        the "queries" rule from [STRINGPREP], set the flag called
 489        "AllowUnassigned".
 490  
 491     2) Split the domain name into individual labels as described in
 492        section 3.1.  The labels do not include the separator.
 493  
 494     3) For each label, decide whether or not to enforce the restrictions
 495        on ASCII characters in host names [STD3].  (Applications already
 496        faced this choice before the introduction of IDNA, and can
 497        continue to make the decision the same way they always have; IDNA
 498        makes no new recommendations regarding this choice.)  If the
 499        restrictions are to be enforced, set the flag called
 500        "UseSTD3ASCIIRules" for that label.
 501  
 502  
 503  
 504  
 505  
 506  Faltstrom, et al.           Standards Track                     [Page 9]
 507  
 508  RFC 3490                          IDNA                        March 2003
 509  
 510  
 511     4) Process each label with either the ToASCII or the ToUnicode
 512        operation as appropriate.  Typically, you use the ToASCII
 513        operation if you are about to put the name into an IDN-unaware
 514        slot, and you use the ToUnicode operation if you are displaying
 515        the name to a user; section 3.1 gives greater detail on the
 516        applicable requirements.
 517  
 518     5) If ToASCII was applied in step 4 and dots are used as label
 519        separators, change all the label separators to U+002E (full stop).
 520  
 521     The following two subsections define the ToASCII and ToUnicode
 522     operations that are used in step 4.
 523  
 524     This description of the protocol uses specific procedure names, names
 525     of flags, and so on, in order to facilitate the specification of the
 526     protocol.  These names, as well as the actual steps of the
 527     procedures, are not required of an implementation.  In fact, any
 528     implementation which has the same external behavior as specified in
 529     this document conforms to this specification.
 530  
 531  4.1 ToASCII
 532  
 533     The ToASCII operation takes a sequence of Unicode code points that
 534     make up one label and transforms it into a sequence of code points in
 535     the ASCII range (0..7F).  If ToASCII succeeds, the original sequence
 536     and the resulting sequence are equivalent labels.
 537  
 538     It is important to note that the ToASCII operation can fail.  ToASCII
 539     fails if any step of it fails.  If any step of the ToASCII operation
 540     fails on any label in a domain name, that domain name MUST NOT be
 541     used as an internationalized domain name.  The method for dealing
 542     with this failure is application-specific.
 543  
 544     The inputs to ToASCII are a sequence of code points, the
 545     AllowUnassigned flag, and the UseSTD3ASCIIRules flag.  The output of
 546     ToASCII is either a sequence of ASCII code points or a failure
 547     condition.
 548  
 549     ToASCII never alters a sequence of code points that are all in the
 550     ASCII range to begin with (although it could fail).  Applying the
 551     ToASCII operation multiple times has exactly the same effect as
 552     applying it just once.
 553  
 554     ToASCII consists of the following steps:
 555  
 556     1. If the sequence contains any code points outside the ASCII range
 557        (0..7F) then proceed to step 2, otherwise skip to step 3.
 558  
 559  
 560  
 561  
 562  Faltstrom, et al.           Standards Track                    [Page 10]
 563  
 564  RFC 3490                          IDNA                        March 2003
 565  
 566  
 567     2. Perform the steps specified in [NAMEPREP] and fail if there is an
 568        error.  The AllowUnassigned flag is used in [NAMEPREP].
 569  
 570     3. If the UseSTD3ASCIIRules flag is set, then perform these checks:
 571  
 572       (a) Verify the absence of non-LDH ASCII code points; that is, the
 573           absence of 0..2C, 2E..2F, 3A..40, 5B..60, and 7B..7F.
 574  
 575       (b) Verify the absence of leading and trailing hyphen-minus; that
 576           is, the absence of U+002D at the beginning and end of the
 577           sequence.
 578  
 579     4. If the sequence contains any code points outside the ASCII range
 580        (0..7F) then proceed to step 5, otherwise skip to step 8.
 581  
 582     5. Verify that the sequence does NOT begin with the ACE prefix.
 583  
 584     6. Encode the sequence using the encoding algorithm in [PUNYCODE] and
 585        fail if there is an error.
 586  
 587     7. Prepend the ACE prefix.
 588  
 589     8. Verify that the number of code points is in the range 1 to 63
 590        inclusive.
 591  
 592  4.2 ToUnicode
 593  
 594     The ToUnicode operation takes a sequence of Unicode code points that
 595     make up one label and returns a sequence of Unicode code points.  If
 596     the input sequence is a label in ACE form, then the result is an
 597     equivalent internationalized label that is not in ACE form, otherwise
 598     the original sequence is returned unaltered.
 599  
 600     ToUnicode never fails.  If any step fails, then the original input
 601     sequence is returned immediately in that step.
 602  
 603     The ToUnicode output never contains more code points than its input.
 604     Note that the number of octets needed to represent a sequence of code
 605     points depends on the particular character encoding used.
 606  
 607     The inputs to ToUnicode are a sequence of code points, the
 608     AllowUnassigned flag, and the UseSTD3ASCIIRules flag.  The output of
 609     ToUnicode is always a sequence of Unicode code points.
 610  
 611     1. If all code points in the sequence are in the ASCII range (0..7F)
 612        then skip to step 3.
 613  
 614  
 615  
 616  
 617  
 618  Faltstrom, et al.           Standards Track                    [Page 11]
 619  
 620  RFC 3490                          IDNA                        March 2003
 621  
 622  
 623     2. Perform the steps specified in [NAMEPREP] and fail if there is an
 624        error.  (If step 3 of ToASCII is also performed here, it will not
 625        affect the overall behavior of ToUnicode, but it is not
 626        necessary.)  The AllowUnassigned flag is used in [NAMEPREP].
 627  
 628     3. Verify that the sequence begins with the ACE prefix, and save a
 629        copy of the sequence.
 630  
 631     4. Remove the ACE prefix.
 632  
 633     5. Decode the sequence using the decoding algorithm in [PUNYCODE] and
 634        fail if there is an error.  Save a copy of the result of this
 635        step.
 636  
 637     6. Apply ToASCII.
 638  
 639     7. Verify that the result of step 6 matches the saved copy from step
 640        3, using a case-insensitive ASCII comparison.
 641  
 642     8. Return the saved copy from step 5.
 643  
 644  5. ACE prefix
 645  
 646     The ACE prefix, used in the conversion operations (section 4), is two
 647     alphanumeric ASCII characters followed by two hyphen-minuses.  It
 648     cannot be any of the prefixes already used in earlier documents,
 649     which includes the following: "bl--", "bq--", "dq--", "lq--", "mq--",
 650     "ra--", "wq--" and "zq--".  The ToASCII and ToUnicode operations MUST
 651     recognize the ACE prefix in a case-insensitive manner.
 652  
 653     The ACE prefix for IDNA is "xn--" or any capitalization thereof.
 654  
 655     This means that an ACE label might be "xn--de-jg4avhby1noc0d", where
 656     "de-jg4avhby1noc0d" is the part of the ACE label that is generated by
 657     the encoding steps in [PUNYCODE].
 658  
 659     While all ACE labels begin with the ACE prefix, not all labels
 660     beginning with the ACE prefix are necessarily ACE labels.  Non-ACE
 661     labels that begin with the ACE prefix will confuse users and SHOULD
 662     NOT be allowed in DNS zones.
 663  
 664  
 665  
 666  
 667  
 668  
 669  
 670  
 671  
 672  
 673  
 674  Faltstrom, et al.           Standards Track                    [Page 12]
 675  
 676  RFC 3490                          IDNA                        March 2003
 677  
 678  
 679  6. Implications for typical applications using DNS
 680  
 681     In IDNA, applications perform the processing needed to input
 682     internationalized domain names from users, display internationalized
 683     domain names to users, and process the inputs and outputs from DNS
 684     and other protocols that carry domain names.
 685  
 686     The components and interfaces between them can be represented
 687     pictorially as:
 688  
 689                      +------+
 690                      | User |
 691                      +------+
 692                         ^
 693                         | Input and display: local interface methods
 694                         | (pen, keyboard, glowing phosphorus, ...)
 695     +-------------------|-------------------------------+
 696     |                   v                               |
 697     |          +-----------------------------+          |
 698     |          |        Application          |          |
 699     |          |   (ToASCII and ToUnicode    |          |
 700     |          |      operations may be      |          |
 701     |          |        called here)         |          |
 702     |          +-----------------------------+          |
 703     |                   ^        ^                      | End system
 704     |                   |        |                      |
 705     | Call to resolver: |        | Application-specific |
 706     |              ACE  |        | protocol:            |
 707     |                   v        | ACE unless the       |
 708     |           +----------+     | protocol is updated  |
 709     |           | Resolver |     | to handle other      |
 710     |           +----------+     | encodings            |
 711     |                 ^          |                      |
 712     +-----------------|----------|----------------------+
 713         DNS protocol: |          |
 714                   ACE |          |
 715                       v          v
 716            +-------------+    +---------------------+
 717            | DNS servers |    | Application servers |
 718            +-------------+    +---------------------+
 719  
 720     The box labeled "Application" is where the application splits a
 721     domain name into labels, sets the appropriate flags, and performs the
 722     ToASCII and ToUnicode operations.  This is described in section 4.
 723  
 724  
 725  
 726  
 727  
 728  
 729  
 730  Faltstrom, et al.           Standards Track                    [Page 13]
 731  
 732  RFC 3490                          IDNA                        March 2003
 733  
 734  
 735  6.1 Entry and display in applications
 736  
 737     Applications can accept domain names using any character set or sets
 738     desired by the application developer, and can display domain names in
 739     any charset.  That is, the IDNA protocol does not affect the
 740     interface between users and applications.
 741  
 742     An IDNA-aware application can accept and display internationalized
 743     domain names in two formats: the internationalized character set(s)
 744     supported by the application, and as an ACE label.  ACE labels that
 745     are displayed or input MUST always include the ACE prefix.
 746     Applications MAY allow input and display of ACE labels, but are not
 747     encouraged to do so except as an interface for special purposes,
 748     possibly for debugging, or to cope with display limitations as
 749     described in section 6.4..  ACE encoding is opaque and ugly, and
 750     should thus only be exposed to users who absolutely need it.  Because
 751     name labels encoded as ACE name labels can be rendered either as the
 752     encoded ASCII characters or the proper decoded characters, the
 753     application MAY have an option for the user to select the preferred
 754     method of display; if it does, rendering the ACE SHOULD NOT be the
 755     default.
 756  
 757     Domain names are often stored and transported in many places.  For
 758     example, they are part of documents such as mail messages and web
 759     pages.  They are transported in many parts of many protocols, such as
 760     both the control commands and the RFC 2822 body parts of SMTP, and
 761     the headers and the body content in HTTP.  It is important to
 762     remember that domain names appear both in domain name slots and in
 763     the content that is passed over protocols.
 764  
 765     In protocols and document formats that define how to handle
 766     specification or negotiation of charsets, labels can be encoded in
 767     any charset allowed by the protocol or document format.  If a
 768     protocol or document format only allows one charset, the labels MUST
 769     be given in that charset.
 770  
 771     In any place where a protocol or document format allows transmission
 772     of the characters in internationalized labels, internationalized
 773     labels SHOULD be transmitted using whatever character encoding and
 774     escape mechanism that the protocol or document format uses at that
 775     place.
 776  
 777     All protocols that use domain name slots already have the capacity
 778     for handling domain names in the ASCII charset.  Thus, ACE labels
 779     (internationalized labels that have been processed with the ToASCII
 780     operation) can inherently be handled by those protocols.
 781  
 782  
 783  
 784  
 785  
 786  Faltstrom, et al.           Standards Track                    [Page 14]
 787  
 788  RFC 3490                          IDNA                        March 2003
 789  
 790  
 791  6.2 Applications and resolver libraries
 792  
 793     Applications normally use functions in the operating system when they
 794     resolve DNS queries.  Those functions in the operating system are
 795     often called "the resolver library", and the applications communicate
 796     with the resolver libraries through a programming interface (API).
 797  
 798     Because these resolver libraries today expect only domain names in
 799     ASCII, applications MUST prepare labels that are passed to the
 800     resolver library using the ToASCII operation.  Labels received from
 801     the resolver library contain only ASCII characters; internationalized
 802     labels that cannot be represented directly in ASCII use the ACE form.
 803     ACE labels always include the ACE prefix.
 804  
 805     An operating system might have a set of libraries for performing the
 806     ToASCII operation.  The input to such a library might be in one or
 807     more charsets that are used in applications (UTF-8 and UTF-16 are
 808     likely candidates for almost any operating system, and script-
 809     specific charsets are likely for localized operating systems).
 810  
 811     IDNA-aware applications MUST be able to work with both non-
 812     internationalized labels (those that conform to [STD13] and [STD3])
 813     and internationalized labels.
 814  
 815     It is expected that new versions of the resolver libraries in the
 816     future will be able to accept domain names in other charsets than
 817     ASCII, and application developers might one day pass not only domain
 818     names in Unicode, but also in local script to a new API for the
 819     resolver libraries in the operating system.  Thus the ToASCII and
 820     ToUnicode operations might be performed inside these new versions of
 821     the resolver libraries.
 822  
 823     Domain names passed to resolvers or put into the question section of
 824     DNS requests follow the rules for "queries" from [STRINGPREP].
 825  
 826  6.3 DNS servers
 827  
 828     Domain names stored in zones follow the rules for "stored strings"
 829     from [STRINGPREP].
 830  
 831     For internationalized labels that cannot be represented directly in
 832     ASCII, DNS servers MUST use the ACE form produced by the ToASCII
 833     operation.  All IDNs served by DNS servers MUST contain only ASCII
 834     characters.
 835  
 836     If a signaling system which makes negotiation possible between old
 837     and new DNS clients and servers is standardized in the future, the
 838     encoding of the query in the DNS protocol itself can be changed from
 839  
 840  
 841  
 842  Faltstrom, et al.           Standards Track                    [Page 15]
 843  
 844  RFC 3490                          IDNA                        March 2003
 845  
 846  
 847     ACE to something else, such as UTF-8.  The question whether or not
 848     this should be used is, however, a separate problem and is not
 849     discussed in this memo.
 850  
 851  6.4 Avoiding exposing users to the raw ACE encoding
 852  
 853     Any application that might show the user a domain name obtained from
 854     a domain name slot, such as from gethostbyaddr or part of a mail
 855     header, will need to be updated if it is to prevent users from seeing
 856     the ACE.
 857  
 858     If an application decodes an ACE name using ToUnicode but cannot show
 859     all of the characters in the decoded name, such as if the name
 860     contains characters that the output system cannot display, the
 861     application SHOULD show the name in ACE format (which always includes
 862     the ACE prefix) instead of displaying the name with the replacement
 863     character (U+FFFD).  This is to make it easier for the user to
 864     transfer the name correctly to other programs.  Programs that by
 865     default show the ACE form when they cannot show all the characters in
 866     a name label SHOULD also have a mechanism to show the name that is
 867     produced by the ToUnicode operation with as many characters as
 868     possible and replacement characters in the positions where characters
 869     cannot be displayed.
 870  
 871     The ToUnicode operation does not alter labels that are not valid ACE
 872     labels, even if they begin with the ACE prefix.  After ToUnicode has
 873     been applied, if a label still begins with the ACE prefix, then it is
 874     not a valid ACE label, and is not equivalent to any of the
 875     intermediate Unicode strings constructed by ToUnicode.
 876  
 877  6.5  DNSSEC authentication of IDN domain names
 878  
 879     DNS Security [RFC2535] is a method for supplying cryptographic
 880     verification information along with DNS messages.  Public Key
 881     Cryptography is used in conjunction with digital signatures to
 882     provide a means for a requester of domain information to authenticate
 883     the source of the data.  This ensures that it can be traced back to a
 884     trusted source, either directly, or via a chain of trust linking the
 885     source of the information to the top of the DNS hierarchy.
 886  
 887     IDNA specifies that all internationalized domain names served by DNS
 888     servers that cannot be represented directly in ASCII must use the ACE
 889     form produced by the ToASCII operation.  This operation must be
 890     performed prior to a zone being signed by the private key for that
 891     zone.  Because of this ordering, it is important to recognize that
 892     DNSSEC authenticates the ASCII domain name, not the Unicode form or
 893  
 894  
 895  
 896  
 897  
 898  Faltstrom, et al.           Standards Track                    [Page 16]
 899  
 900  RFC 3490                          IDNA                        March 2003
 901  
 902  
 903     the mapping between the Unicode form and the ASCII form.  In the
 904     presence of DNSSEC, this is the name that MUST be signed in the zone
 905     and MUST be validated against.
 906  
 907     One consequence of this for sites deploying IDNA in the presence of
 908     DNSSEC is that any special purpose proxies or forwarders used to
 909     transform user input into IDNs must be earlier in the resolution flow
 910     than DNSSEC authenticating nameservers for DNSSEC to work.
 911  
 912  7. Name server considerations
 913  
 914     Existing DNS servers do not know the IDNA rules for handling non-
 915     ASCII forms of IDNs, and therefore need to be shielded from them.
 916     All existing channels through which names can enter a DNS server
 917     database (for example, master files [STD13] and DNS update messages
 918     [RFC2136]) are IDN-unaware because they predate IDNA, and therefore
 919     requirement 2 of section 3.1 of this document provides the needed
 920     shielding, by ensuring that internationalized domain names entering
 921     DNS server databases through such channels have already been
 922     converted to their equivalent ASCII forms.
 923  
 924     It is imperative that there be only one ASCII encoding for a
 925     particular domain name.  Because of the design of the ToASCII and
 926     ToUnicode operations, there are no ACE labels that decode to ASCII
 927     labels, and therefore name servers cannot contain multiple ASCII
 928     encodings of the same domain name.
 929  
 930     [RFC2181] explicitly allows domain labels to contain octets beyond
 931     the ASCII range (0..7F), and this document does not change that.
 932     Note, however, that there is no defined interpretation of octets
 933     80..FF as characters.  If labels containing these octets are returned
 934     to applications, unpredictable behavior could result.  The ASCII form
 935     defined by ToASCII is the only standard representation for
 936     internationalized labels in the current DNS protocol.
 937  
 938  8. Root server considerations
 939  
 940     IDNs are likely to be somewhat longer than current domain names, so
 941     the bandwidth needed by the root servers is likely to go up by a
 942     small amount.  Also, queries and responses for IDNs will probably be
 943     somewhat longer than typical queries today, so more queries and
 944     responses may be forced to go to TCP instead of UDP.
 945  
 946  
 947  
 948  
 949  
 950  
 951  
 952  
 953  
 954  Faltstrom, et al.           Standards Track                    [Page 17]
 955  
 956  RFC 3490                          IDNA                        March 2003
 957  
 958  
 959  9. References
 960  
 961  9.1 Normative References
 962  
 963     [RFC2119]    Bradner, S., "Key words for use in RFCs to Indicate
 964                  Requirement Levels", BCP 14, RFC 2119, March 1997.
 965  
 966     [STRINGPREP] Hoffman, P. and M. Blanchet, "Preparation of
 967                  Internationalized Strings ("stringprep")", RFC 3454,
 968                  December 2002.
 969  
 970     [NAMEPREP]   Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
 971                  Profile for Internationalized Domain Names (IDN)", RFC
 972                  3491, March 2003.
 973  
 974     [PUNYCODE]   Costello, A., "Punycode: A Bootstring encoding of
 975                  Unicode for use with Internationalized Domain Names in
 976                  Applications (IDNA)", RFC 3492, March 2003.
 977  
 978     [STD3]       Braden, R., "Requirements for Internet Hosts --
 979                  Communication Layers", STD 3, RFC 1122, and
 980                  "Requirements for Internet Hosts -- Application and
 981                  Support", STD 3, RFC 1123, October 1989.
 982  
 983     [STD13]      Mockapetris, P., "Domain names - concepts and
 984                  facilities", STD 13, RFC 1034 and "Domain names -
 985                  implementation and specification", STD 13, RFC 1035,
 986                  November 1987.
 987  
 988  9.2 Informative References
 989  
 990     [RFC2535]    Eastlake, D., "Domain Name System Security Extensions",
 991                  RFC 2535, March 1999.
 992  
 993     [RFC2181]    Elz, R. and R. Bush, "Clarifications to the DNS
 994                  Specification", RFC 2181, July 1997.
 995  
 996     [UAX9]       Unicode Standard Annex #9, The Bidirectional Algorithm,
 997                  <http://www.unicode.org/unicode/reports/tr9/>.
 998  
 999     [UNICODE]    The Unicode Consortium. The Unicode Standard, Version
1000                  3.2.0 is defined by The Unicode Standard, Version 3.0
1001                  (Reading, MA, Addison-Wesley, 2000. ISBN 0-201-61633-5),
1002                  as amended by the Unicode Standard Annex #27: Unicode
1003                  3.1 (http://www.unicode.org/reports/tr27/) and by the
1004                  Unicode Standard Annex #28: Unicode 3.2
1005                  (http://www.unicode.org/reports/tr28/).
1006  
1007  
1008  
1009  
1010  Faltstrom, et al.           Standards Track                    [Page 18]
1011  
1012  RFC 3490                          IDNA                        March 2003
1013  
1014  
1015     [USASCII]    Cerf, V., "ASCII format for Network Interchange", RFC
1016                  20, October 1969.
1017  
1018  10. Security Considerations
1019  
1020     Security on the Internet partly relies on the DNS.  Thus, any change
1021     to the characteristics of the DNS can change the security of much of
1022     the Internet.
1023  
1024     This memo describes an algorithm which encodes characters that are
1025     not valid according to STD3 and STD13 into octet values that are
1026     valid.  No security issues such as string length increases or new
1027     allowed values are introduced by the encoding process or the use of
1028     these encoded values, apart from those introduced by the ACE encoding
1029     itself.
1030  
1031     Domain names are used by users to identify and connect to Internet
1032     servers.  The security of the Internet is compromised if a user
1033     entering a single internationalized name is connected to different
1034     servers based on different interpretations of the internationalized
1035     domain name.
1036  
1037     When systems use local character sets other than ASCII and Unicode,
1038     this specification leaves the the problem of transcoding between the
1039     local character set and Unicode up to the application.  If different
1040     applications (or different versions of one application) implement
1041     different transcoding rules, they could interpret the same name
1042     differently and contact different servers.  This problem is not
1043     solved by security protocols like TLS that do not take local
1044     character sets into account.
1045  
1046     Because this document normatively refers to [NAMEPREP], [PUNYCODE],
1047     and [STRINGPREP], it includes the security considerations from those
1048     documents as well.
1049  
1050     If or when this specification is updated to use a more recent Unicode
1051     normalization table, the new normalization table will need to be
1052     compared with the old to spot backwards incompatible changes.  If
1053     there are such changes, they will need to be handled somehow, or
1054     there will be security as well as operational implications.  Methods
1055     to handle the conflicts could include keeping the old normalization,
1056     or taking care of the conflicting characters by operational means, or
1057     some other method.
1058  
1059     Implementations MUST NOT use more recent normalization tables than
1060     the one referenced from this document, even though more recent tables
1061     may be provided by operating systems.  If an application is unsure of
1062     which version of the normalization tables are in the operating
1063  
1064  
1065  
1066  Faltstrom, et al.           Standards Track                    [Page 19]
1067  
1068  RFC 3490                          IDNA                        March 2003
1069  
1070  
1071     system, the application needs to include the normalization tables
1072     itself.  Using normalization tables other than the one referenced
1073     from this specification could have security and operational
1074     implications.
1075  
1076     To help prevent confusion between characters that are visually
1077     similar, it is suggested that implementations provide visual
1078     indications where a domain name contains multiple scripts.  Such
1079     mechanisms can also be used to show when a name contains a mixture of
1080     simplified and traditional Chinese characters, or to distinguish zero
1081     and one from O and l.  DNS zone adminstrators may impose restrictions
1082     (subject to the limitations in section 2) that try to minimize
1083     homographs.
1084  
1085     Domain names (or portions of them) are sometimes compared against a
1086     set of privileged or anti-privileged domains.  In such situations it
1087     is especially important that the comparisons be done properly, as
1088     specified in section 3.1 requirement 4.  For labels already in ASCII
1089     form, the proper comparison reduces to the same case-insensitive
1090     ASCII comparison that has always been used for ASCII labels.
1091  
1092     The introduction of IDNA means that any existing labels that start
1093     with the ACE prefix and would be altered by ToUnicode will
1094     automatically be ACE labels, and will be considered equivalent to
1095     non-ASCII labels, whether or not that was the intent of the zone
1096     adminstrator or registrant.
1097  
1098  11. IANA Considerations
1099  
1100     IANA has assigned the ACE prefix in consultation with the IESG.
1101  
1102  
1103  
1104  
1105  
1106  
1107  
1108  
1109  
1110  
1111  
1112  
1113  
1114  
1115  
1116  
1117  
1118  
1119  
1120  
1121  
1122  Faltstrom, et al.           Standards Track                    [Page 20]
1123  
1124  RFC 3490                          IDNA                        March 2003
1125  
1126  
1127  12. Authors' Addresses
1128  
1129     Patrik Faltstrom
1130     Cisco Systems
1131     Arstaangsvagen 31 J
1132     S-117 43 Stockholm  Sweden
1133  
1134     EMail: paf@cisco.com
1135  
1136  
1137     Paul Hoffman
1138     Internet Mail Consortium and VPN Consortium
1139     127 Segre Place
1140     Santa Cruz, CA  95060  USA
1141  
1142     EMail: phoffman@imc.org
1143  
1144  
1145     Adam M. Costello
1146     University of California, Berkeley
1147  
1148     URL: http://www.nicemice.net/amc/
1149  
1150  
1151  
1152  
1153  
1154  
1155  
1156  
1157  
1158  
1159  
1160  
1161  
1162  
1163  
1164  
1165  
1166  
1167  
1168  
1169  
1170  
1171  
1172  
1173  
1174  
1175  
1176  
1177  
1178  Faltstrom, et al.           Standards Track                    [Page 21]
1179  
1180  RFC 3490                          IDNA                        March 2003
1181  
1182  
1183  13. Full Copyright Statement
1184  
1185     Copyright (C) The Internet Society (2003).  All Rights Reserved.
1186  
1187     This document and translations of it may be copied and furnished to
1188     others, and derivative works that comment on or otherwise explain it
1189     or assist in its implementation may be prepared, copied, published
1190     and distributed, in whole or in part, without restriction of any
1191     kind, provided that the above copyright notice and this paragraph are
1192     included on all such copies and derivative works.  However, this
1193     document itself may not be modified in any way, such as by removing
1194     the copyright notice or references to the Internet Society or other
1195     Internet organizations, except as needed for the purpose of
1196     developing Internet standards in which case the procedures for
1197     copyrights defined in the Internet Standards process must be
1198     followed, or as required to translate it into languages other than
1199     English.
1200  
1201     The limited permissions granted above are perpetual and will not be
1202     revoked by the Internet Society or its successors or assigns.
1203  
1204     This document and the information contained herein is provided on an
1205     "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
1206     TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
1207     BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
1208     HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
1209     MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
1210  
1211  Acknowledgement
1212  
1213     Funding for the RFC Editor function is currently provided by the
1214     Internet Society.
1215  
1216  
1217  
1218  
1219  
1220  
1221  
1222  
1223  
1224  
1225  
1226  
1227  
1228  
1229  
1230  
1231  
1232  
1233  
1234  Faltstrom, et al.           Standards Track                    [Page 22]
1235