Federal Communications Commission March 8 th , 2017 AT&T VoLTE 911 Outage Report and Recommendations Public Safety Docket No. 17-68 A Report of the Public Safety and Homeland Security Bureau Federal Communications Commission May 2017 Federal Communications Commission 2 TABLE OF CONTENTS Heading Paragraph # I. EXECUTIVE SUMMARY.................................................................................................................... 1 II. BACKGROUND.................................................................................................................................... 4 III. FACTUAL FINDINGS ABOUT THE MARCH 8 TH OUTAGE ........................................................... 7 IV. AT&T ACTIONS TO PREVENT RECURRENCE ............................................................................ 25 V. NEXT STEPS....................................................................................................................................... 30 APPENDIX A: Illustration of AT&T’s 911 Network Architecture and Outage APPENDIX B: Outage Remediation and PSAP Notification Timeline APPENDIX C: Unique Users Impacted by State APPENDIX D: List of Commenters and Ex Parte Notices Federal Communications Commission 3 I. EXECUTIVE SUMMARY 1. On the afternoon of March 8 th , 2017, nearly all AT&T Mobility (AT&T) 1 Voice over LTE customers across the nation lost 911 service for five hours. 2 Federal Communications Commission (Commission) Chairman Ajit Pai immediately directed the Public Safety and Homeland Security Bureau (Bureau) to investigate the causes, effects and implications of the outage. 3 In response, the Bureau reviewed and analyzed outage reports filed in its Network Outage Reporting System (NORS), 4 as well as sought and reviewed public comments and related documents, and held meetings with relevant stakeholders, including service providers and public safety entities. The Bureau also examined the record to identify ways to prevent future occurrences of such an outage. This report presents the Bureau’s findings. 2. As described in greater detail below, the outage was caused by an error that likely could have been avoided had AT&T implemented additional checks (e.g., followed certain network reliability best practices) with respect to their critical 911 network assets. Approximately 12,600 unique users attempted to call 911, but were unable to reach emergency services through the traditional 911 network. This was one of the largest 911 outages ever reported in NORS, as measured by the number of unique users affected. 3. Among the lessons learned from the March 8 th outage is that when 911 service fails for any reason, Public Safety Answering Points (PSAPs) play a critical role in advising their jurisdictions of alternative ways to reach help. While AT&T and their subcontractors, Comtech and West, made efforts to notify thousands of PSAPs, the notifications were often unclear or missing important information, and generally took a few hours to occur. This outage also offers an illuminating case study that illustrates actions that stakeholders can take to promote network reliability and continued access to 911 service. For example, the March 8 th outage emphasizes the importance of auditing all network assets critical to the provision of 911 service, and ensuring that such assets are safeguarded and designed to avoid single points of failure. The outage also demonstrates the need for closer coordination between industry and PSAPs, to improve overall situational awareness and ensure consumers understand how best to reach emergency services. II. BACKGROUND 4. One of the Commission’s primary objectives is to “make available, so far as possible, to all people of the United States . . . a . . . wire and radio communication service . . . for the purpose of promoting safety of life and property.” 5 In furtherance of this objective, the Commission has taken 1 AT&T Mobility LLC is a wholly-owned subsidiary of AT&T that provides wireless services to 135 million subscribers in the United States. See AT&T Inc., Form 8-K, Current Report Pursuant to Section 13 or 15(d) of the Securities Exchange Act of 1934 (Jan. 25, 2017). 2 Voice over long-term evolution (Voice over LTE, or VoLTE) is a technology specification that defines the standards and procedures for delivering voice communication and data over 4G LTE networks. 3 See Press Release, FCC, FCC Chairman Ajit Pai Announces Investigation into Yesterday’s 911 Outage (March 9, 2017), https://apps.fcc.gov/edocs_public/attachmatch/DOC-343825A1.pdf. 4 NORS is the Commission’s web-based filing system through which communications providers covered by the Part 4 outage reporting rules must submit reports to the Commission. These reports are presumed confidential to protect sensitive and proprietary information about communications networks. See 47 CFR § 4.2. 5 The Communications Act of 1934 established the FCC, in part, “for the purpose of promoting safety of life and property through the use of wire and radio communication.” 47 U.S.C. § 151. Congress has repeatedly and (continued….) Federal Communications Commission 4 measures to promote the reliable and continued availability of 911 telecommunications service. In 1997, the Commission adopted rules requiring Commercial Mobile Radio Service (CMRS) providers to implement 911 and Enhanced 911 services, and to “transmit all wireless 911 calls without respect to their call validation process to a Public Safety Answering Point.” 6 5. The Commission has adopted PSAP outage notification requirements where service outages could affect the delivery of 911 calls. In the 2004 Part 4 Report and Order, the Commission required “originating service providers” to notify PSAPs “as soon as possible” when they have experienced an outage that “potentially affects” a 911 special facility, and convey “all available information that may be useful to the management of the affected facility in mitigating the effects of the outage on callers to that facility.” 7 Originating service providers include cable communications providers, satellite operators, wireless service providers, and wireline communications providers – entities that offer the ability “to originate 911 calls.” 8 In the 2013 911 Reliability Order, the Commission adopted PSAP outage notification requirements for service providers that offer core 911 capabilities or deliver 911 calls and associated number or location information to the appropriate PSAP, defining them as “covered 911 service providers.” 9 The Commission required covered 911 service providers to notify 911 special (Continued from previous page) specifically endorsed a role for the Commission in the nationwide implementation of advanced 911 capabilities. See Wireless Communications and Public Safety Act of 1999, PL 106–81, 113 Stat 1286 §§ 3(a), (b) (1999) (codified at 47 U.S.C. § 251(e)(3), 47 U.S.C. § 615) (directing the Commission to “designate 911 as the universal emergency telephone number within the United States for reporting an emergency to appropriate authorities and requesting assistance” and to “encourage and support efforts by States to deploy comprehensive end-to-end emergency communications infrastructure and programs, based on coordinated statewide plans, including seamless, ubiquitous, reliable wireless telecommunications networks and enhanced wireless 911 service.”); see also New and Emerging Technologies 911 Improvement Act of 2008 (NET 911 Act), PL 110–283, 122 Stat 2620 (2008) (codified at 47 U.S.C. § 615a-1(a), (c)(1)(B)) (requiring “each IP-enabled voice service provider to provide 9-1-1 service and enhanced 9-1-1 service to its subscribers in accordance with the requirements of the Federal Communications Commission”); Twenty–First Century Communications and Video Accessibility Act of 2010, PL 111-260, 124 Stat 2751 § 106(g) (2010) (CVAA) (codified at 47 U.S.C. § 615c(g)). 6 See Revision of the Commission’s Rules to Ensure Compatibility with Enhanced 911 Emergency Calling Systems, CC Docket No. 94-102, RM-8143, Memorandum Opinion and Order, 12 FCC Rcd 22665, 22744 (1997); Transition from TTY to Real-Time Text Technology; Petition for Rulemaking to Update the Commission's Rules for Access to Support the Transition from TTY to Real-Time Text Technology and Petition for Waiver of the Rules Requiring Support for TTY Technology, CG Docket No. 16-145, GN Docket No. 15-178, Report and Order and Further Notice of Proposed Rulemaking, 31 FCC Rcd 13568 (2016) (applying an analogous requirement to common carriers); see also 47 CFR § 20.18(b); 47 CFR § 64.3001. 7 See New Part 4 of the Commission’s Rules Concerning Disruptions to Communications, ET Docket No. 04-35, Report and Order and Further Notice of Proposed Rulemaking, 19 FCC Rcd 16830 (2004) (2004 Part 4 Report and Order); 47 CFR § 4.9. 8 47 CFR § 12.4(a)(4)(ii)(B) (defining an originating service provider); 47 CFR §§ 4.9(a), (c), (e), (f) (detailing parallel PSAP notification requirements for cable, satellite, wireless and wireline service providers); see also Improving 911 Reliability; Reliability and Continuity of Communications Networks, Including Broadband Technologies, PS Docket Nos. 13-75, 11-60, Report and Order, 28 FCC Rcd 17476, 17488-89, para. 36 (2013) (911 Reliability Order). 9 See 47 CFR § 12.4(a)(4) (defining covered 911 service providers as entities that provide call routing, automatic location information (ALI), automatic number information (ANI), or the functional equivalent of those capabilities “directly to a public safety answering point” or appropriate local emergency authority, and can also include entities that operate one or more central offices that directly serve a PSAP); see also 911 Reliability Order, 28 FCC Rcd at 17490, para. 37 (stating that the Commission’s adopted definition of covered 911 service provider reflects that “while most current 911 networks rely on the infrastructure of an incumbent local exchange carrier (ILEC), no (continued….) Federal Communications Commission 5 facilities of outages that potentially affect them within 30 minutes of discovering an outage. 10 The Commission further required that covered 911 service providers update PSAPs within two hours of their initial contact in order to communicate available information about the nature of the outage, its best- known cause, geographic scope, and the estimated time for repairs. 11 In its comments to this 2013 proceeding, APCO urged the Commission to extend these more specific PSAP notification rules to originating service providers as well, but the Commission declined to do so because covered 911 service providers “are the entities most likely to experience outages affecting 911 service,” and deferred the issue for future consideration. 12 6. In addition to adopting PSAP outage notification requirements, the 911 Reliability Order also adopted 911 network reliability requirements for covered 911 service providers. 13 These requirements were based on best practices developed and recommended by the Commission’s federal advisory committee, the Communications Security, Reliability, and Interoperability Council (CSRIC) and were intended to address the network reliability problems that were brought to light by the 2012 “derecho” storm outages. 14 The Commission’s 911 reliability rules require covered 911 service providers to “certify annually whether they have, within the past year, audited the physical diversity of critical 911 circuits or equivalent data paths to each PSAP they serve, tagged those circuits to minimize the risk that (Continued from previous page) single type of entity will always provide 911 service in every community,” especially in light of the IP transition, and recognizing that “overbroad rules could inadvertently impose obligations on entities that provide peripheral support for NG911 but may not play a central role in ensuring 911 reliability or benefit as much as a typical circuit- switched ILEC from the best practices” integrated into the Commission’s 911 network reliability rules). 10 Compare 47 CFR § 4.9(h) (requiring covered 911 service providers to notify affected PSAPs “no later than 30 minutes from discovering the outage) with 47 CFR § 4.9(e) (requiring originating service providers to notify affected PSAPs “as soon as possible”). The Commission’s PSAP notification requirements for covered 911 service providers are generally more specific than those that apply to originating service providers, requiring covered 911 service providers (as defined in 47 CFR § 12.4(a)(4)) to “convey all available information that may be useful in mitigating the effects of the outage, as well as a name, telephone number, and e-mail address at which the service provider can be reached for follow-up.” See 47 CFR § 4.9(h). Further, covered 911 service providers must “communicate additional material information to the affected 911 special facility as it becomes available, but no later than two hours after the initial contact,” including “the nature of the outage, its best-known cause, the geographic scope of the outage, the estimated time for repairs, and any other information that may be useful to the management of the affected facility.” See id. Finally, covered 911 service providers must notify PSAPs by telephone and in writing via electronic means in the absence of another method mutually agreed upon in advance by the 911 special facility and the covered 911 service provider. See id. 11 See id. 12 911 Reliability Order, 28 FCC Rcd at 17528-29, para. 147; see also Letter from Robert M. Gurss, Senior Regulatory Counsel, APCO International, to Marlene Dortch, Secretary, Federal Communications Commission, PS Docket Nos. 13-75, 11-60, at 1 (filed June 17, 2013) (arguing that “the definition of ‘911 service provider’ for purposes of outage notification requirements should be sufficiently broad to include any facilities or services involved in the initiation, transport, or delivery of a 911 call,” including wireline, wireless, and interconnected VoIP providers and transport systems associated with the delivery of call and caller information). 13 See 47 CFR §§ 12.4(b)-(c). 14 See 911 Reliability Order, 28 FCC Rcd at 17489-91, 17493-98, paras. 36-43, 48-65. The National Weather Service defines a derecho as “a widespread, long-lived wind storm that is associated with a band of rapidly moving showers or thunderstorms. Robert H. Johns, Jeffry S. Evans, & Stephen F. Corfidi, About Derechos, NOAA-NWS- NCEP Storm Prediction Center (Nov.7, 2012), http://www.spc.noaa.gov/misc/AbtDerechos/derechofacts.htm. Federal Communications Commission 6 they will be reconfigured at some future date, and eliminated all single points of failure.” 15 In the alternative, the Commission permitted covered 911 service providers to describe “reasonably sufficient alternative measures they have taken to mitigate the risks associated with the lack of physical diversity.” 16 In 2014, the Commission proposed to revise these 911 reliability requirements to address failures that led to the 2014 multi-state outages, and proposed additional mechanisms designed to ensure that the Commission’s 911 governance structure kept pace with evolving technologies and new reliability challenges. 17 III. FACTUAL FINDINGS ABOUT THE MARCH 8 TH OUTAGE 7. Description of Normal 911 Call Processing in AT&T’s VoLTE Network. During an emergency, an individual should be able to dial “911” from anywhere in the Nation and be connected to the appropriate PSAP. AT&T provides this service, which entails significant call routing and processing, in its role as an originating service provider. 18 The call routing and processing steps for AT&T’s VoLTE network are described below. 1) An AT&T customer dials “911” on their mobile phone while on AT&T’s VoLTE network. 2) The caller is connected to a sector of a nearby LTE cell tower. 3) Upon recognizing the call as a 911 call, AT&T’s 911 network sends only the call data to one of its 911 call routing service subcontractors. 4) The subcontractor determines the appropriate PSAP to receive the 911 call based on the caller’s geographic location, and adds metadata to the call that will enable AT&T to route it to the appropriate PSAP. 15 911 Reliability Order, 28 FCC Rcd at 17503, para. 80; see also 47 CFR § 12.4(c)(1). Regular circuit diversity audits are a CSRIC best practice. See CSRIC Best Practice 8-7-0532, https://www.fcc.gov/nors/outage/bestpractice/DetailedBestPractice.cfm?number=8-7-0532 (last visited Apr. 18, 2017). Diversity audits check for “single points of failure” in network configurations, while tagging ensures that changes to critical 911 assets cannot be made without rigorous review. 16 911 Reliability Order, 28 FCC Rcd at 17503, para. 80; 47 CFR § 12.4(b). This 2013 proceeding deferred for future consideration whether network reliability requirements should be extended to originating service providers. See 911 Reliability Order, 28 FCC Rcd at 17528-29, para. 147. The Commission took additional steps in 2016 to promote wireless resiliency by finding that the voluntary Wireless Network Resiliency Cooperative Framework “provides a rational basis for promoting an alternative path toward improved wireless resiliency without the need for relying on regulatory approaches.” See Improving the Resiliency of Mobile Wireless Communications Networks; Reliability and Continuity of Communications Networks, Including Broadband Technologies, PS Docket Nos. 13- 239, 11-60, Order, 31 FCC Rcd 13745 (2016) (Mobile Wireless Resiliency Order). The voluntary framework approved in that order applies only to emergencies in which the FCC activates the Disaster Information Reporting System (DIRS). The Commission closed this Mobile Wireless Resiliency proceeding with this Order. 17 See generally 911 Governance and Accountability; Improving 911 Reliability, PS Docket Nos. 14-193, 13-75, Policy Statement and Notice of Proposed Rulemaking, 29 FCC Rcd 14208 (2014) (911 Governance NPRM) (examining methods to ensure end-to-end responsibility for the provision of 911 service). Among other measures, the 911 Governance NPRM sought comment on whether the Commission’s 911 network reliability provisions should apply to originating service providers, and on measures to improve PSAPs’ situational awareness during outages. See id. 18 See 47 CFR § 12.4(a)(4)(ii)(B). Federal Communications Commission 7 5) The subcontractor returns the 911 call data, now with information regarding the appropriate PSAP to receive the 911 call, back to AT&T. 6) Based on this information, AT&T delivers the call to the local exchange carrier that serves the appropriate PSAP. 19 7) The local exchange carrier delivers the call to the appropriate PSAP and a 911 call-taker answers the phone. 8. Of particular relevance to this outage is the communications path between AT&T and its 911 call routing subcontractors, Comtech and West. 20 Comtech and West maintain call routing information for separate geographic regions for AT&T within the United States. AT&T decides whether to send the 911 call to Comtech or West (in step 3 described above) based on the caller’s geographic location by using a node called the Proxy Location Routing Function (PLRF). This node determines whether Comtech or West serves the geographic area from which the call originated by using information about the caller’s cell site sector. AT&T sends the call data to one of two gateways that Comtech and West can access. These gateways, known as Session Border Controllers, control access between AT&T’s network and external networks. 21 9. When Comtech or West returns the supplemented 911 call data to AT&T’s 911 network in step 5, the Session Border Controllers perform a check to make sure that the incoming traffic originates from a predetermined set of IP addresses that AT&T’s 911 live network is programmed to trust. This list of trusted IP addresses is called a “whitelist.” This policy protects AT&T’s 911 network from unintentional or malicious traffic. AT&T maintains a record of whitelisted IP addresses in a customer provisioning system. A technical illustration of AT&T’s 911 architecture, as well as how this outage occurred, is provided as Appendix A. 22 10. Root Causes of the Outage. The failures that caused this outage occurred entirely within AT&T’s network. As outlined above, AT&T maintains connections with Comtech and West to obtain 911 call routing information. The connections between AT&T and Comtech and between AT&T and West are critical to 911 call routing because connectivity to Comtech and West enables AT&T to access PSAP call routing information. 19 See infra Appendix A (Illustration of AT&T’s 911 Network Architecture and Outage). 20 Comtech Telecommunications Corporation (Comtech) (formerly TCS) is a provider of 911 and emergency communications infrastructure, systems and services to telecommunications service providers and public safety agencies throughout the United States. See Comtech Telecommunications Corp., Form 8-K, Current Report Pursuant to Section 13 or 15(d) of the Securities Exchange Act of 1934 (Mar. 8, 2017). West Safety Services, Inc. (West) (formerly Intrado Inc.), a wholly-owned subsidiary of West Corporation, provides emergency communications services and infrastructure systems and services to communications service providers and public safety organizations throughout the United States. See West Corporation, Form 10-K, Annual Report Pursuant to Section 13 or 15(d) of the Securities Exchange Act of 1934 (Feb, 16, 2017). West and Comtech are the two providers that offer location routing service for AT&T VoLTE calls. Comtech and West each maintain two geographically diverse Gateway Mobile Location Centers (GMLCs). GMLCs insert the Emergency Services Routing Key (ESRK) into 911 call data, allowing the call to be routed to the appropriate PSAP. 21 See infra Appendix A (Illustration of AT&T’s 911 Network Architecture and Outage) (illustrating these gateways as “SBCs”). 22 See infra Appendix A (Illustration of AT&T’s 911 Network Architecture and Outage). Federal Communications Commission 8 11. Sometime prior to March 8 th , AT&T placed an incorrect record of whitelisted IP addresses into its customer provisioning system, which contains records of AT&T’s network inventory. 23 Specifically, the incorrect record did not contain the appropriate IP addresses for Comtech. Although AT&T retains log files for its customer provisioning system for 90 days, it has not been able to determine when this incorrect record was placed into its customer provisioning system nor why it happened. AT&T also did not detect the mismatch between the whitelist in the customer provisioning system and the whitelist on the live network through routine inventory management. Nonetheless, because errors in customer provisioning system records, in themselves, do not affect the live network, communications between AT&T and Comtech were unaffected. 12. On March 8 th , AT&T unintentionally broke its connection to Comtech. While working on an unrelated project, AT&T initiated a network change that pushed the record containing the incorrect whitelist onto AT&T’s live network. With Comtech’s IP addresses no longer included on the whitelist, the connection with Comtech was broken, disrupting the flow of information regarding the appropriate PSAP to receive certain 911 calls to AT&T’s network. 24 Notably, AT&T was able to make this network change without extensive testing, and during peak 911 traffic hours, because the connections to the Session Border Controllers that maintained the whitelist were tagged as “customer” assets. Assets tagged as “infrastructure,” in contrast, are updated separately, only after rigorous failure testing, and during specified off-peak maintenance periods. 13. When the loss of connectivity between AT&T and Comtech led both of AT&T’s Session Border Controllers to fail to receive routing information from Comtech, they began to generate error messages along the paths between the Session Border Controller and the PLRF. This generated critical 911 alarms to AT&T’s 911 troubleshooting team as early as sixteen minutes after the outage began. 25 AT&T notified its internal troubleshooting teams serially – starting with the 911 team, then the VoLTE team, then the Universal Service Platform team responsible for AT&T’s VoLTE 911 network as a whole, then the Core Backbone team – all before the IP team. 14. When the PLRF received error messages from the Session Border Controllers that surpassed a certain density threshold, the PLRF responded, as programmed, by performing a soft reset on the links between itself and the Session Border Controllers. 26 Comtech and West both transmitted 911 call data to AT&T along each of these paths, so AT&T could not receive transmissions from either 23 A customer provisioning system contains records of a service provider’s network inventory, which are assigned in the network as part of the service provisioning process. The live network refers to the actual assets in use in a service network at a given point in time. 24 Comtech communicates with AT&T using many pre-approved IP addresses, but AT&T’s customer provisioning system database contained only one. When it replaced the IP address whitelist for Comtech with its single entry, there was no longer a perfect match between the IP addresses from which Comtech was sending supplemented 911 call data to AT&T, and the IP addresses from which it expected, so data from Comtech as rejected. 25 AT&T maintains distinct internal troubleshooting teams for each major network element. Each internal troubleshooting team is organized into tiers, with more skilled technicians assigned to higher-numbered troubleshooting tiers. Each troubleshooting team has the independent capability to escalate an issue to a higher tier or to another team, as it deems appropriate. 26 This process of turning apparently malfunctioning links off and then back on (rebooting them) is designed to prevent the PLRF from continuing to look for call routing information from a non-functioning Session Border Controller when call data could be supplied via the alternate Session Border Controller. Federal Communications Commission 9 Comtech or West while both links were turned off. 27 Once the links came back online, call processing resumed for West, only to be turned off again when the PLRF again performed a soft reset on the links due to a new flood of error messages because the whitelist was still broken. 15. Where AT&T failed to receive appropriate PSAP call routing information from Comtech or West for a given 911 call, AT&T routed that 911 call to the Emergency Call Relay Center, a backup call center staffed with professional call takers that could manually route the calls to the appropriate PSAP by soliciting location information from the caller. 28 The backup call center was not intended to address a nationwide outage and could not handle all of this additional traffic. 29 As a result, it dropped the overwhelming majority of calls that it received. 16. Almost five hours after the outage began, AT&T’s IP Troubleshooting team discovered that a network change from its customer provisioning system coincided with the start time of the outage. The IP Troubleshooting team requested a system rollback, which occurred three minutes later, ending the outage. A timeline of AT&T’s attempts to remediate this outage is provided in Appendix B. 30 17. Network Impacts. The result was a nationwide 911 VoLTE outage on AT&T’s VoLTE network lasting for five hours and one minute. The Bureau’s investigation indicates that the outage affected AT&T’s VoLTE wireless customers in 49 states, the District of Columbia, Puerto Rico, and the Virgin Islands. 31 AT&T’s normal VoLTE call processing was not otherwise affected. Some localities reported not being affected by the outage, but this may have been due to PSAPs’ inability to detect outages occurring in service provider networks. AT&T reports that approximately 12,600 unique callers were not able to reach 911 directly during the outage. 32 AT&T acknowledges that “[b]ecause the outage was widespread geographically, thousands of PSAPs were potentially affected.” 33 18. The 911 VoLTE outage did not affect service on AT&T’s 3G network or text-to-911 messaging functions over its 4G LTE network. VoLTE 911 calls in regions of the United States that ordinarily would have been routed with support from Comtech’s service could not be completed. Furthermore, although the whitelist errors only directly impacted Comtech, both West and Comtech were affected because AT&T did not maintain separate logical paths for Comtech and West between the PLRF 27 There was no independent failure in either Comtech’s or West’s networks. 28 See infra Appendix A (Illustration of AT&T’s 911 Network Architecture and Outage) (referring to this backup call center as the ECRC). 29 On a typical day, nearly 100 percent of calls are routed to the proper PSAP automatically, and the backup call center does not need to be engaged. To the extent that it does need to be engaged, the backup call center is designed only to handle a small fraction of calls, which (for various causes) may not route properly to the PSAP. In contrast, however, in order to be prepared to handle a nationwide outage, AT&T would have needed to maintain backup call routing sufficient to simulate the manual call-taking processes of all 6,386 Primary PSAPs nationwide. See FCC, 911 Master PSAP Registry, https://www.fcc.gov/general/9-1-1-master-psap-registry (last visited Apr. 26, 2017). 30 See infra Appendix B (Outage Remediation and PSAP Notification Timeline). 31 A list of the number of unique users and states affected by the outage is included as Appendix C. See infra Appendix C (Unique Users Impacted by State). 32 See AT&T, Final NORS Report (Apr. 11, 2017). A small subset of these calls were completed after being rerouted to the Emergency Call Relay Center, until that backup call center became overloaded. 33 AT&T Services, Inc. Comments, PS Docket Nos. 17-68, at 4 (filed April 7, 2017). Federal Communications Commission 10 and the Session Border Controller. 34 Calls from the remainder of the country that ordinarily would have been routed with support from West’s service were unable to be completed while the links were turned off, even though there was no independent failure in West’s network. During the intervals when these links were turned back on, VoLTE 911 calls that were directed to West for routing information were able to complete as normal. As the outage persisted, the links continued to flap on and off, causing VoLTE 911 calls supported by West to cycle between working and non-working states. 19. Notifications to PSAPs. Most, but apparently not all PSAPs received word of the outage affecting AT&T customers from a variety of sources, including direct notification from AT&T, Comtech, and West. PSAPs received notification by both phone and e-mail. 35 The first notice sent to a PSAP, which was by AT&T, occurred approximately 3˝ hours after the outage started, approximately 2˝ hours after AT&T sent internal mass notifications to company executives and senior staff about the event, and approximately 2 hours after Comtech learned, in conversation with AT&T, that no calls to 911 were getting through. 36 Specifically, AT&T began notifying a handful of PSAPs at 19:26 CST, over three and half hours after the outage started, via phone and e-mail. 37 At 19:58 CST, AT&T sent an e-mail communication to all of the approximately 3,800 PSAPs served by AT&T Wireline services. At 20:11 CST, Comtech sent notifications informing over 5,300 PSAPs nationwide of the outage and its resolution. 38 At 20:25 CST, West sent notification e-mails to all of the approximately 4,784 wireless PSAPs in its database, and it sent a follow-up notification of the outage’s resolution approximately an 34 Logical diversity, sometimes called equipment diversity, means that two circuits are provisioned to use different transmission equipment, but could share the same transmission medium (for example, the same fiber or conduit). See 911 Reliability Order, 28 FCC Rcd at 17504, para. 83 (providing examples of logical diversity as contrasted with physical diversity). 35 Some public safety entities report a preference for notification via phone, rather than e-mail, during an outage. See, e.g., Letter from Julie Righter Dove, PSAP Official, Lincoln/Lancaster, Nebraska 911, to Federal Communications Commission, PS Docket No. 17-68, at 1 (filed Apr. 19, 2017) (Lincoln/Lancaster Nebraska 911 Ex Parte Letter) (stating that email is not monitored with the same priority as phone calls). Others consider e-mail notification to be acceptable, so long as it is “comprehensive and detailed.” See Letter from Tanessa Cabe, Telecommunications Counsel, New York City Information Technology and Telecommunication, to Marlene Dortch, Secretary, Federal Communications Commission, PS Docket No. 17-68, at 1 (filed Mar. 31, 2017) (NYC ITT Ex Parte Letter) (stating that while e-mail notification is acceptable, e-mails should be “comprehensive and detailed” and “other forms of notification such as phone calls” are recommended “as a backup depending on the type of outage”). 36 See infra Appendix B (Outage Remediation and PSAP Notification Timeline). 37 AT&T Comments at 4. A timeline illustrating AT&T’s discovery and efforts to remediate this outage, as well as its efforts to notify PSAPS, is included as Appendix B. See infra Appendix B (Outage Remediation and PSAP Notification Timeline). 38 Comtech Comments at 3. Federal Communications Commission 11 hour later. 39 At least one affected PSAP in Nebraska reported receiving no notification of the outage from any service provider. 40 A timeline of PSAP notifications provided by AT&T is included as Appendix B. 41 20. Affected PSAPs further report that when notifications occurred, they contained very little useful information about the extent or nature of the outage. For example, Minnesota PSAPs report that initial notification e-mails from Comtech were “ambiguous,” simply stating that a “potential impairment” could impact wireless 911 calls in the area. 42 Minnesota PSAPs found this notification confusing, particularly because they were still receiving 911 calls from AT&T customers at that time. 43 AT&T should have known that the outage was limited to their VoLTE service once they discovered the network error because the error only affected their 911 VoLTE infrastructure, but, according to AT&T, during the time in question, the focus was on restoring service rather than on determining the extent of the outage. In any case, this information was not conveyed to PSAPs. Comtech’s notification to Colorado PSAPs indicated that the outage was limited to 911 VoLTE calling, but included no additional information about the outage’s cause, scope, or geographic impact. 44 The Washington, D.C. PSAP similarly reports that notification from West “was very broad and did not give a geographical scope of the outage.” 45 The notifications did not include an estimated time for repairs. Some PSAPs report that they reached out directly to AT&T in order to clarify the scope and cause of the outage, but not all were successful. 46 Public safety entities indicate that initial notification from originating service providers should apprise PSAPs of the network elements and geographic locations affected by the outage, as well as its expected 39 See Letter from Daryl Branson, Senior 911 Telecom Analyst, Colorado Public Utilities Commission, to Marlene Dortch, Secretary, Federal Communications Commission, PS Docket No. 17-68, at 8 (filed April 3, 2017) (Colorado PUC Ex Parte Letter); Letter from John Haynes, Deputy Director for 9-1-1, Department of Emergency Services, The County of Chester, to Marlene Dortch, Secretary, Federal Communications Commission, PS Docket No. 17-68, at 1 (filed April 6, 2017) (Chester County, PA Ex Parte Letter ). Some jurisdictions separate the calls according to their originating platform and deliver them to separate PSAPs. Wireless PSAPs are PSAPs to which wireless 911 calls are forwarded. 40 See Lincoln/Lancaster Nebraska 911 Ex Parte Letter at 1; NYC ITT Ex Parte Letter at 1 (“The PSAC was not contacted by the carrier or any other state or federal entity regarding the incidents. The City became aware of the outage through press outlets.”); cf. AT&T Comments at 4 (“Based on the FCC Interim Report and various media accounts, we believe that many local governments received the notice needed to timely communicate the outage and alternate localized emergency contact information to the residents of their areas.”) citing Presentation of Lisa M. Fowlkes, Acting Bureau Chief, Public Safety and Homeland Security Bureau, FCC, March 8 th AT&T Mobility VoLTE 911 Outage Preliminary Report (Mar. 23, 2017) (FCC Interim Report). 41 See infra Appendix B (Outage Remediation and PSAP Notification Timeline). 42 Letter from Dana Wahlberg, State of Minnesota 9-1-1 Program Manager, to Marlene Dortch, Secretary, Federal Communications Commission, PS Docket No. 17-68, at 1 (filed April 20, 2017) (State of Minnesota Ex Parte Letter). 43 Minnesota Department of Public Safety Ex Parte Letter at 1. The calls Minnesota PSAPs received were likely from AT&T callers using legacy networks, but they did not receive sufficient information in the notification to glean this. 44 Colorado PUC Ex Parte Letter at 21. 45 Letter from Karima Holmes, Director, Office of Unified Communications, Washington, DC, to Federal Communication Commission, PS Docket No. 17-68, at 1 (filed Mar. 31, 2017) (Washington, DC OUC Ex Parte Letter). 46 See Washington, DC OUC Ex Parte Letter at 1; see also Letter from Teresa Jacobs, Mayor, Orange County, Florida, to Ajit Pai, Chairman, Federal Communications Commission (Mar. 10, 2017) (on file with author). Federal Communications Commission 12 duration. 47 This would provide situational awareness to PSAPs so that they can communicate with the public more effectively. 48 21. AT&T indicates that both the large geographic scope and the unique circumstances of the March 8 th outage impacted the timing and extent of PSAP notifications. AT&T was unaware of the extent of the outage until several hours after it began, and initially believed that the outage was located in, and limited to, 911 calls requiring Comtech’s support. In addition, because the outage was intermittent for the PSAPs served primarily with support from West and because some calls were able to get through via the backup Emergency Call Routing Center, the number of PSAPs impacted by the outage was not immediately clear. 22. Notification from affected service providers notwithstanding, PSAPs across the country used a variety of methods to determine whether they were affected by the outage, and if so, the outage’s scope. Many PSAPs – including PSAPs in Colorado and Washington, D.C. – first became aware of the outage through contact with other affected PSAPs or posts on social media. 49 A number of public safety entities made comparisons to historical PSAP call data to determine that an outage was occurring, and made test calls from a variety of communications service providers’ mobile devices to determine that an outage was impacting AT&T’s VoLTE network. 50 PSAPs that support text-to-911 also reported sending test texts and determined that text-to-911 capability remained in service for AT&T’s VoLTE customers during the outage. 51 These resource-intensive efforts could have been obviated by timely and effective notification from affected service providers. 23. PSAPs affected by the outage took steps to notify the public of alternative methods to reach emergency services. For example, PSAPs notified the public of alternative 10-digit emergency numbers that they could use in an emergency while 911 was unavailable for AT&T’s VoLTE customers. 52 APCO reports that “PSAPs and 9-1-1 authorities largely utilized social media to spread 47 APCO Ex Parte Letter at 1 (“PSAPs need to know where and when the outage occurred, the nature of the outage, and expected repair time.”); NYC ITT Ex Parte Letter at 1 (stating that notifications should include the “scope, type of event, impact, severity, granular geographic location by census tract, expected resolution time, and any other information about the outage that would be particular to New York City.”). 48 See Letter from Richard Taylor, Executive Director, North Carolina 911 Board, to Federal Communications Commission, PS Docket No. 17-68, at 2 (filed Apr. 21, 2017) (NC 911 Board Ex Parte Letter at 1) (stating that information about an outage’s network scope, geographic scope, and estimated time of remediation helps PSAPs to decide when and how to notify the public). 49 See Washington, DC OUC Ex Parte Letter at 1; Colorado PUC Ex Parte Letter at 3; Letter from Jeffrey S. Cohen, Chief Counsel, APCO International, Marlene Dortch, Secretary, Federal Communications Commission, PS Docket No. 17-68, at 1 (filed on April 10, 2017) (APCO Ex Parte Letter). A NASNA e-mail chain at 8:50 CST alerted PSAPs across the country to the possibility of an AT&T service outage in their area, before many PSAPs had received initial notification from any service provider. See Washington, DC OUC Ex Parte Letter at 1. 50 See, e.g., Colorado PUC Ex Parte Letter at 3 (reporting that Colorado PSAPs began testing calls from AT&T devices after they received reports of an AT&T outage through an e-mail listserv indicating that at least some PSAPs in the state were unable to receive 911 VoLTE calls from AT&T devices, while others appeared to be unaffected ). 51 See Colorado PUC Ex Parte Letter at 9. 52 See, e.g., Washington, DC OUC Ex Parte Letter at 1; Letter from Teresa Jacobs, Mayor, Orange County, Florida, to Ajit Pai, Chairman, Federal Communications Commission (Mar. 10, 2017) (on file with author). Federal Communications Commission 13 awareness and share information about the outage.” 53 PSAPs in Chester County, Pennsylvania and the Washington, D.C. PSAP also requested that local media run an on-screen text crawl about the outage, and used mass notification tools to alert registered individuals. 54 Additionally, public safety officials in Orange County, Florida held a press conference to notify the public of the outage. PSAPs report that this outreach was successful. For example, representatives from Orange County, Florida reported that they received 172 calls to an alternative 10-digit emergency phone number in the hour and a half after they released it, far exceeding normal call volume. 24. Public Impact. During the outage, approximately 12,600 unique users attempted to call 911, but were unable to reach emergency services through the traditional 911 network. AT&T customers reportedly heard either fast busy signals, endless ringing or silence when they called 911. 55 The mayor of Orange County, Florida reports that one AT&T customer experiencing a medical emergency was unable to reach emergency services via his mobile device. 56 The customer was only able to reach the Orlando Fire Department through a home security system. 57 Motorists involved in a traffic accident in Orange County, Florida were also unable to reach 911 from their AT&T devices. 58 These examples highlight the critical importance of uninterrupted public access to emergency services and the reliability of 911 networks nationwide. Other localities affected by the outage did not report receiving public complaints. 59 IV. AT&T ACTIONS TO PREVENT RECURRENCE 25. AT&T states that it has taken four major steps to prevent the recurrence of a similar 911 outage, and to improve early 911 outage detection and mitigation. First, AT&T no longer treats Session Border Controller connections between itself and its 911 call routing subcontractors as “customer” assets. Instead, AT&T now treats them as “infrastructure” assets. Changes to infrastructure assets must go through a more rigorous and careful testing process than changes to customer assets before being implemented in the live network. Had AT&T used this approach before the March 8 th outage, it would likely have noticed the incorrect IP address assignment during the testing process, before it was implemented in the field. 26. Second, AT&T has made changes to its internal alarm system to make sure that the errors generated in conditions similar to the March 8 th outage are received immediately and concurrently by its 911 troubleshooting team, its VoLTE troubleshooting team, and its IP team. AT&T engaged its troubleshooting teams serially, and not all teams with expertise relevant to resolving the outage were 53 See APCO Ex Parte Letter at 1; see also Colorado PUC Ex Parte Letter at 4 (stating that they used Twitter and other social media for public notification); Chester County, PA Ex Parte Letter at 1; Washington DC OUC Ex Parte Letter at 1 (stating that they used the mass notification system, AlertDC). 54 See Chester County, PA Ex Parte Letter at 1; Washington, DC OUC Ex Parte Letter at 1. 55 Colorado PUC Ex Parte Letter at 8. 56 See Letter from Teresa Jacobs, Mayor, Orange County, Florida, to Ajit Pai, Chairman, Federal Communications Commission (Mar. 10, 2017) (on file with author). 57 See id. 58 See id. 59 See, e.g., Chester County, PA Ex Parte Letter at 1 (stating that they received no public complaints); NC 911 Board Ex Parte Letter at 1 (stating that he is not aware of any negative consequences in North Carolina due to the March 8 th outage and received no public feedback). Federal Communications Commission 14 immediately notified of its occurrence. The outage could have been resolved sooner had all troubleshooting teams been involved from first alarm. 27. Third, AT&T has bifurcated the links that connect the Session Border Controllers to the PLRF. This provides Comtech and West with separate logical communications paths. Had this bifurcation been in place on March 8 th , the outage would have only affected 911 calls processed by Comtech and would not have affected 911 calls processed by West. This change reduces the likelihood that a future network issue encountered by one 911 call routing information provider will impact call processing attempted by the other. 28. Fourth, AT&T has implemented a manual process to drop VoLTE service and fall back to 3G for 911 calls during VoLTE 911 outages. 60 During an unrelated AT&T VoLTE outage that occurred on March 11, 2017, AT&T was able to successfully deliver most 911 VoLTE calls to appropriate PSAPs. 61 The nature of the event caused some VoLTE customers to not be able to register on the AT&T VoLTE network, but AT&T was able to use an automated process to register some of them on their 3G network instead. This fallback mechanism did not work on March 8 th because the network issue that caused the outage occurred further along in the call setup path. Had the manual mechanism that AT&T has now implemented been available in the circumstances of the March 8 th outage, it could have mitigated the outage as successfully as the automated process did during the unrelated AT&T VoLTE outage on March 11 th . 29. The Bureau anticipates that these voluntary changes will help AT&T to prevent a recurrence of a similar 911 outage and may help AT&T with future 911 outage detection and remediation. V. NEXT STEPS 30. The Commission has been unwavering in its commitment to ensuring continued access to 911 service. Commencing the investigation of the March 8 th , 2017 VoLTE 911 outage and following through with this report is a demonstration of that commitment. But there is more to do. 31. This outage offers an illuminating case study of actions that stakeholders can take to promote network reliability and continued access to 911 service. For example, based on the Bureau’s analysis of the March 8, 2017 AT&T VoLTE 911 outage, CSRIC’s recommended network reliability best practices could have prevented this outage or mitigated its impact. Specifically, CSRIC recommended that network operators should establish processes for verifying that changes to network configurations minimize the possibility of call processing errors 62 and that network operators periodically audit their logical networks for diversity. 63 Had AT&T followed these best practices, it could have prevented this 60 According to AT&T, an automated process would not work in this instance because of the nature of the network connectivity issue, and because of the location in AT&T’s 911 network in which the error occurred. 61 The Bureau is currently in the process of investigating the March 11 th , 2017 outage. The Bureau also notes that AT&T experienced another VoLTE 911 outage on May 1 st , 2017. The Bureau’s preliminary research indicates that these outage were unrelated and attributable to different causes than the March 8 th , 2017 outage. The Bureau will produce separate case studies on its findings. 62 See CSRIC Best Practice 9-9-8729, https://www.fcc.gov/nors/outage/bestpractice/DetailedBestPractice.cfm?number=9-9-8729 (last visited May 12, 2017). 63 See CSRIC Best Practice 8-7-0532, https://www.fcc.gov/nors/outage/bestpractice/DetailedBestPractice.cfm?number=8-7-0532 (last visited Apr. 18, 2017). Federal Communications Commission 15 outage or mitigated its impact. 32. The Bureau plans to engage in stakeholder outreach and guidance regarding CSRIC’s recommended network reliability best practices to protect against similar outages in the future. In particular, the Bureau plans to release a Public Notice reminding companies of best practices and their importance. The Bureau will also be contacting other major VoLTE providers to discuss their network practices, and will offer its assistance to smaller VoLTE providers. 33. This outage also highlights the need for close working coordination between industry and PSAPs to improve overall situational awareness and ensure consumers understand how best to reach emergency services. In particular, there is a need for further industry coordination and discussion surrounding the processes and roles that stakeholders play for informing consumers about how to continue to reach 911 during an outage. The Bureau can help to foster this kind of coordination and guidance. In this regard, the Bureau plans to conduct stakeholder outreach to help promote better understanding of 911 outage notification best practices. The Bureau will convene consumer groups, public safety entities and service providers in the 911 ecosystem to participate in a workshop in order to discuss best practices and develop recommendations for improving situational awareness during 911 outages, including strengthening PSAP outage notifications and how to best communicate with consumers about alternative methods of accessing emergency services. Federal Communications Commission 16 APPENDIX A Illustration of AT&T’s 911 Architecture and Outage Federal Communications Commission 17 Glossary EPC – Evolved Packet Core: A framework which combines voice and data on a 4G LTE network. SBC – Session Border Controller: A device that authenticates, validates and controls traffic from other network elements. E-CSCF – Emergency Call Session Control Function: The primary network controller responsible for managing 911 VoLTE calls. PLRF – Proxy Location Retrieval Function: A device that determines whether 911 call data is should be directed to Comtech or West for processing. VPN – Virtual Private Network: A method of providing secure, encrypted access to remote devices. GMLC – Gateway Mobile Location Center: A control system that retrieves and provides location information of wireless devices. It has a database that indexes cell sector and PSAP location information to support emergency call routing. ESRK – Emergency Services Routing Key: Metadata that is used to direct the call to the appropriate PSAP. ECRC – Emergency Call Relay Center: A backup call center staffed with professional call takers that could manually route the calls to the appropriate PSAP by soliciting location information from the caller Federal Communications Commission 18 APPENDIX B: Timeline of Outage Remediation and PSAP Notification TIME (CST) EVENTDESCRIPTION TIME ELAPSED 15:52 Outage begins after change request initiated by customer provisioning system replaced existing route map prefix set 0 mins 16:03 Critical alarm tickets auto-created over PLRF-SBC link 11 mins 16:08 AT&T 911 Tier 1 Troubleshooting Team acknowledges the alarm tickets 16 mins 16:17 AT&T 911 Tier 2 Troubleshooting Team engaged and investigating alarms 25 mins 16:27 AT&T 911 Tier 3 Troubleshooting Team engages 35 mins 16:34 AT&T’s internal operations communications center is notified for the purpose of providing internal communications related to this outage 42 mins 16:54 AT&T 911 Tier 3 Troubleshooting Team engages PLRF external vendor (node that generated alarm) 1 hr, 2 mins 17:05 911 Tier 2 Troubleshooting Team contacts Comtech NOC, and learns no 911 calls are connecting 1 hr, 13 mins 17:33 – 18:40 VoLTE Troubleshooting teams engage to assist; perform a soft reset on the links between the PLRF and the SBCs with no success 1 hr, 41 mins – 2 hrs, 48 mins 19:03 – 20:30 VoLTE Tier 3 Troubleshooting Team coordinates with Comtech and CBB troubleshooting teams to identify that there may be a routing issue preventing Comtech’s traffic from being received by AT&T, although AT&T ’s traffic is getting through to Comtech 3 hrs, 11 mins – 4 hrs, 38 mins 19:26 – 20:39 AT&T PSAP Relations communicates with Tarrant County, Texas; Washington, DC; Arizona; California; Oregon; Michigan, Las Vegas, Nevada 64 3 hrs, 34 mins – 4 hrs, 47 mins 19:58 AT&T sends e-mail notification to all AT&T Wireline PSAPs (~3,800) 4 hrs, 6 mins 20:11 Comtech notifies all PSAPs in its database (~5,300) using an e-mail listserv 4 hrs, 19 mins 20:20 – 20:45 AT&T’s IP Troubleshooting team traces 911 call IP packet routing through a peering router, an unintended path. 4 hrs, 28 mins – 4 hrs, 53 mins 20:25 Upon AT&T request, West notifies all Primary wireless PSAPs in its database (~4,784) 4 hrs, 33 mins 20:50 AT&T IP Troubleshooting team discovers network change with the same start time as the outage, IP team requests system rollback 4 hrs, 58 mins 64 These PSAPs either contacted AT&T during the outage or had previously requested that AT&T notify them of mobility 911 outages. Federal Communications Commission 19 20:53 Rollback completed. Service restored. 5 hrs, 1 min 21:14 Comtech sends notification that outage has been resolved to all PSAPs in its database (~5,300) using an email listserv 5 hrs, 22 min 21:39 Upon AT&T request, West sends notification that the outage has been resolved. 5 hrs, 37 mins Federal Communications Commission 20 APPENDIX C Unique Users Impacted by State The table below reflects AT&T’s quantification of the number of unique users affected by the March 8 th , 2017 AT&T Outage. State Unique Users Impacted AK 43 AL 213 AR 240 AZ 107 CA 1473 CO 133 CT 98 DC 59 DE 32 FL 937 GA 521 HI 78 IA 21 ID 12 IL 501 IN 338 KS 73 KY 261 LA 372 MA 123 MD 255 ME 12 MI 505 MN 90 MO 328 MS 135 MT 2 NC 271 ND 6 NE 15 NH 9 NJ 193 NM 41 Federal Communications Commission 21 NV 134 NY 563 OH 302 OK 380 OR 90 PA 456 PR 65 RI 16 SC 129 SD 11 TN 230 TX 1968 UT 65 VA 180 VI 17 VT 9 WA 238 WI 80 WV 109 TOTALS 49 States, the District of Columbia, Puerto Rico and the Virgin Islands 65 12,539 Unique Users Affected 65 AT&T reports that Wyoming (WY) was not impacted by this outage. This may be due to its small population, its low population density, or the low density of AT&T LTE cell sites in Wyoming. Federal Communications Commission 22 APPENDIX D List of Parties Filing Comments or Ex Parte Notices PS Docket No. 17-68 Commenters AT&T Services Inc. Comtech Telecommunications Corp. Ex Parte Filers Association of Public-Safety Communications Officials (APCO) International National Association of State 911 Administrators (NASNA) Colorado Public Utilities Commission City of New York Information Technology and Telecommunications Arkansas Department of Emergency Management Washington, D.C. Office of Unified Communications California Office of Emergency Services, Emergency Communications Branch County of Chester, Pennsylvania Department of Emergency Services Minnesota Department of Public Safety, Emergency Communication Networks Lincoln/Lancaster, Nebraska 911 North Carolina 911 Board Texas Commission on State Emergency Communications Iowa Homeland Security and Emergency Management