WorryFree Computers   »   [go: up one dir, main page]

The Reporting API is an emerging web standard that provides a generic reporting mechanism for issues occurring on the browsers visiting your production website. The reports you receive detail issues such as security violations or soon-to-be-deprecated APIs, from users’ browsers from all over the world.

Collecting reports is often as simple as specifying an endpoint URL in the HTTP header; the browser will automatically start forwarding reports covering the issues you are interested in to those endpoints. However, processing and analyzing these reports is not that simple. For example, you may receive a massive number of reports on your endpoint, and it is possible that not all of them will be helpful in identifying the underlying problem. In such circumstances, distilling and fixing issues can be quite a challenge.

In this blog post, we'll share how the Google security team uses the Reporting API to detect potential issues and identify the actual problems causing them. We'll also introduce an open source solution, so you can easily replicate Google's approach to processing reports and acting on them.

How does the Reporting API work?

Some errors only occur in production, on users’ browsers to which you have no access. You won't see these errors locally or during development because there could be unexpected conditions real users, real networks, and real devices are in. With the Reporting API, you directly leverage the browser to monitor these errors: the browser catches these errors for you, generates an error report, and sends this report to an endpoint you've specified.

How reports are generated and sent.

Errors you can monitor with the Reporting API include:

For a full list of error types you can monitor, see use cases and report types.

The Reporting API is activated and configured using HTTP response headers: you need to declare the endpoint(s) you want the browser to send reports to, and which error types you want to monitor. The browser then sends reports to your endpoint in POST requests whose payload is a list of reports.

Example setup:

#  Example setup to receive CSP violations reports, Document-Policy violations reports, and Deprecation reports  

Reporting-Endpoints: main-endpoint="https://reports.example/main", default="https://reports.example/default"

# CSP violations and Document-Policy violations will be sent to `main-endpoint`

Content-Security-Policy: script-src 'self'; object-src 'none'; report-to main-endpoint;

Document-Policy: document-write=?0; report-to=main-endpoint;

# Deprecation reports are generated automatically and don't need an explicit endpoint; they're always sent to the `default` endpoint

Note: Some policies support "report-only" mode. This means the policy sends a report, but doesn't actually enforce the restriction. This can help you gauge if the policy is working effectively.

Chrome users whose browsers generate reports can see them in DevTools in the Application panel:

Example of viewing reports in the Application panel of DevTools.

You can generate various violations and see how they are received on a server in the reporting endpoint demo:

Example violation reports

The Reporting API is supported by Chrome, and partially by Safari as of March 2024. For details, see the browser support table.

Google's approach

Google benefits from being able to uplift security at scale. Web platform mitigations like Content Security Policy, Trusted Types, Fetch Metadata, and the Cross-Origin Opener Policy help us engineer away entire classes of vulnerabilities across hundreds of Google products and thousands of individual services, as described in this blogpost.

One of the engineering challenges of deploying security policies at scale is identifying code locations that are incompatible with new restrictions and that would break if those restrictions were enforced. There is a common 4-step process to solve this problem:

  1. Roll out policies in report-only mode (CSP report-only mode example). This instructs browsers to execute client-side code as usual, but gather information on any events where the policy would be violated if it were enforced. This information is packaged in violation reports that are sent to a reporting endpoint.
  2. The violation reports must be triaged to link them to locations in code that are incompatible with the policy. For example, some code bases may be incompatible with security policies because they use a dangerous API or use patterns that mix user data and code.
  3. The identified code locations are refactored to make them compatible, for example by using safe versions of dangerous APIs or changing the way user input is mixed with code. These refactorings uplift the security posture of the code base by helping reduce the usage of dangerous coding patterns.
  4. When all code locations have been identified and refactored, the policy can be removed from report-only mode and fully enforced. Note that in a typical roll out, we iterate steps 1 through 3 to ensure that we have triaged all violation reports.

With the Reporting API, we have the ability to run this cycle using a unified reporting endpoint and a single schema for several security features. This allows us to gather reports for a variety of features across different browsers, code paths, and types of users in a centralized way.

Note: A violation report is generated when an entity is attempting an action that one of your policies forbids. For example, you've set CSP on one of your pages, but the page is trying to load a script that's not allowed by your CSP. Most reports generated via the Reporting API are violation reports, but not all — other types include deprecation reports and crash reports. For details, see Use cases and report types.

Unfortunately, it is common for noise to creep into streams of violation reports, which can make finding incompatible code locations difficult. For example, many browser extensions, malware, antivirus software, and devtools users inject third-party code into the DOM or use forbidden APIs. If the injected code is incompatible with the policy, this can lead to violation reports that cannot be linked to our code base and are therefore not actionable. This makes triaging reports difficult and makes it hard to be confident that all code locations have been addressed before enforcing new policies.

Over the years, Google has developed a number of techniques to collect, digest, and summarize violation reports into root causes. Here is a summary of the most useful techniques we believe developers can use to filter out noise in reported violations:

Focus on root causes

It is often the case that a piece of code that is incompatible with the policy executes several times throughout the lifetime of a browser tab. Each time this happens, a new violation report is created and queued to be sent to the reporting endpoint. This can quickly lead to a large volume of individual reports, many of which contain redundant information. Because of this, grouping violation reports into clusters enables developers to abstract away individual violations and think in terms of root causes. Root causes are simpler to understand and can speed up the process of identifying useful refactorings.

Let's take a look at an example to understand how violations may be grouped. For instance, a report-only CSP that forbids the use of inline JavaScript event handlers is deployed. Violation reports are created on every instance of those handlers and have the following fields set:

  • The blockedURL field is set to inline, which describes the type of violation.
  • The scriptSample field is set to the first few bytes of the contents of the event handler in the field.
  • The documentURL field is set to the URL of the current browser tab.

Most of the time, these three fields uniquely identify the inline handlers in a given URL, even if the values of other fields differ. This is common when there are tokens, timestamps, or other random values across page loads. Depending on your application or framework, the values of these fields can differ in subtle ways, so being able to do fuzzy matches on reporting values can go a long way in grouping violations into actionable clusters. In some cases, we can group violations whose URL fields have known prefixes, for example all violations with URLs that start with chrome-extension, moz-extension, or safari-extension can be grouped together to set root causes in browser extensions aside from those in our codebase with a high degree of confidence.

Developing your own grouping strategies helps you stay focused on root causes and can significantly reduce the number of violation reports you need to triage. In general, it should always be possible to select fields that uniquely identify interesting types of violations and use those fields to prioritize the most important root causes.

Leverage ambient information

Another way of distinguishing non-actionable from actionable violation reports is ambient information. This is data that is contained in requests to our reporting endpoint, but that is not included in the violation reports themselves. Ambient information can hint at sources of noise in a client's set up that can help with triage:

  • User Agent or User Agent client hints: User agents are a great tell-tale sign of non-actionable violations. For example, crawlers, bots, and some mobile applications use custom user agents whose behavior differs from well-supported browser engines and that can trigger unique violations. In other cases, some violations may only trigger in a specific browser or be caused by changes in nightly builds or newer versions of browsers. Without user agent information, these violations would be significantly more difficult to investigate.
  • Trusted users: Browsers will attach any available cookies to requests made to a reporting endpoint by the Reporting API, if the endpoint is same-site with the document where the violation occurs. Capturing cookies is useful for identifying the type of user that caused a violation. Often, the most actionable violations come from trusted users that are not likely to have invasive extensions or malware, like company employees or website administrators. If you are not able to capture authentication information through your reporting endpoint, consider rolling out report-only policies to trusted users first. Doing so allows you to build a baseline of actionable violations before rolling out your policies to the general public.
  • Number of unique users: As a general principle, users of typical features or code paths should generate roughly the same violations. This allows us to flag violations seen by a small number of users as potentially suspicious, since they suggest that a user's particular setup might be at fault, rather than our application code. One way of 'counting users' is to keep note of the number of unique IP addresses that reported a violation. Approximate counting algorithms are simple to use and can help gather this information without tracking specific IP addresses. For example, the HyperLogLog algorithm requires just a few bytes to approximate the number of unique elements in a set with a high degree of confidence.
Map violations to source code (advanced)

Some types of violations have a source_file field or equivalent. This field represents the JavaScript file that triggered the violation and is usually accompanied by a line and column number. These three bits of data are a high-quality signal that can point directly to lines of code that need to be refactored.

Nevertheless, it is often the case that source files fetched by browsers are compiled or minimized and don't map directly to your code base. In this case, we recommend you use JavaScript source maps to map line and column numbers between deployed and authored files. This allows you to translate directly from violation reports to lines of source code, yielding highly actionable report groups and root causes.

Establish your own solution

The Reporting API sends browser-side events, such as security violations, deprecated API calls, and browser interventions, to the specified endpoint on a per-event basis. However, as explained in the previous section, to distill the real issues out of those reports, you need a data processing system on your end.

Fortunately, there are plenty of options in the industry to set up the required architecture, including open source products. The fundamental pieces of the required system are the following:

  • API endpoint: A web server that accepts HTTP requests and handles reports in a JSON format
  • Storage: A storage server that stores received reports and reports processed by the pipeline
  • Data pipeline: A pipeline that filters out noise and extracts and aggregates required metadata into constellations
  • Data visualizer: A tool that provides insights on the processed reports

Solutions for each of the components listed above are made available by public cloud platforms, SaaS services, and as open source software. See the Alternative solutions section for details, and the following section outlining a sample application.

Sample application: Reporting API Processor

To help you understand how to receive reports from browsers and how to handle these received reports, we created a small sample application that demonstrates the following processes that are required for distilling web application security issues from reports sent by browsers:

  • Report ingestion to the storage
  • Noise reduction and data aggregation
  • Processed report data visualization

Although this sample is relying on Google Cloud, you can replace each of the components with your preferred technologies. An overview of the sample application is illustrated in the following diagram:



Components described as green boxes are components that you need to implement by yourself. Forwarder is a simple web server that receives reports in the JSON format and converts them to the schema for Bigtable. Beam-collector is a simple Apache Beam pipeline that filters noisy reports, aggregates relevant reports into the shape of constellations, and saves them as CSV files. These two components are the key parts to make better use of reports from the Reporting API.

Try it yourself

Because this is a runnable sample application, you are able to deploy all components to a Google Cloud project and see how it works by yourself. The detailed prerequisites and the instructions to set up the sample system are documented in the README.md file.

Alternative solutions

Aside from the open source solution we shared, there are a number of tools available to assist in your usage of the Reporting API. Some of them include:

  • Report-collecting services like report-uri and uriports.
  • Application error monitoring platforms like Sentry, Datadog, etc.

Besides pricing, consider the following points when selecting alternatives:

  • Are you comfortable sharing any of your application's URLs with a third-party report collector? Even if the browser strips sensitive information from these URLs, sensitive information may get leaked this way. If this sounds too risky for your application, operate your own reporting endpoint.
  • Does this collector support all report types you need? For example, not all reporting endpoint solutions support COOP/COEP violation reports.

Summary

In this article, we explained how web developers can collect client-side issues by using the Reporting API, and the challenges of distilling the real problems out of the collected reports. We also introduced how Google solves those challenges by filtering and processing reports, and shared an open source project that you can use to replicate a similar solution. We hope this information will motivate more developers to take advantage of the Reporting API and, in consequence, make their website more secure and sustainable.

Learning resources

Generative AI has emerged as a powerful and popular tool to automate content creation and simple tasks. From customized content creation to source code generation, it can increase both our productivity and creative potential.

Businesses want to leverage the power of LLMs, like Gemini, but many may have security concerns and want more control around how employees make sure of these new tools. For example, companies may want to ensure that various forms of sensitive data, such as Personally Identifiable Information (PII), financial records and internal intellectual property, is not to be shared publicly on Generative AI platforms. Security leaders face the challenge of finding the right balance — enabling employees to leverage AI to boost efficiency, while also safeguarding corporate data.

In this blog post, we'll explore reporting and enforcement policies that enterprise security teams can implement within Chrome Enterprise Premium for data loss prevention (DLP).

1. View login events* to understand usage of Generative AI services within the organization. With Chrome Enterprise's Reporting Connector, security and IT teams can see when a user successfully signs into a specific domain, including Generative AI websites. Security Operations teams can further leverage this telemetry to detect anomalies and threats by streaming the data into Chronicle or other third-party SIEMs at no additional cost.

2. Enable URL Filtering to warn users about sensitive data policies and let them decide whether or not they want to navigate to the URL, or to block users from navigating to certain groups of sites altogether.


For example, with Chrome Enterprise URL Filtering, IT admins can create rules that warn developers not to submit source code to specific Generative AI apps or tools, or block them.

3. Warn, block or monitor sensitive data actions within Generative AI websites with dynamic content-based rules for actions like paste, file uploads/downloads, and print. Chrome Enterprise DLP rules give IT admins granular control over browser activities, such as entering financial information in Gen AI websites. Admins can customize DLP rules to restrict the type and amount of data entered into these websites from managed browsers.

For most organizations, safely leveraging Generative AI requires a certain amount of control. As enterprises work through their policies and processes involving GenAI, Chrome Enterprise Premium empowers them to strike the balance that works best. Hear directly from security leaders at Snap on their use of DLP for Gen AI in this recording here.

Learn more about how Chrome Enterprise can secure businesses just like yours here.

*Available at no additional cost in Chrome Enterprise Core

Keeping people safe and their data secure and private is a top priority for Android. That is why we took our time when designing the new Find My Device, which uses a crowdsourced device-locating network to help you find your lost or misplaced devices and belongings quickly – even when they’re offline. We gave careful consideration to the potential user security and privacy challenges that come with device finding services.

During development, it was important for us to ensure the new Find My Device was secure by default and private by design. To build a private, crowdsourced device-locating network, we first conducted user research and gathered feedback from privacy and advocacy groups. Next, we developed multi-layered protections across three main areas: data safeguards, safety-first protections, and user controls. This approach provides defense-in-depth for Find My Device users.

How location crowdsourcing works on the Find My Device network

The Find My Device network locates devices by harnessing the Bluetooth proximity of surrounding Android devices. Imagine you drop your keys at a cafe. The keys themselves have no location capabilities, but they may have a Bluetooth tag attached. Nearby Android devices participating in the Find My Device network report the location of the Bluetooth tag. When the owner realizes they have lost their keys and logs into the Find My Device mobile app, they will be able to see the aggregated location contributed by nearby Android devices and locate their keys.

Find My Device network protections


Let’s dive into key details of the multi-layered protections for the Find My Device network:

  • Data Safeguards: We’ve implemented protections that help ensure the privacy of everyone participating in the network and the crowdsourced location data that powers it.
    • Location data is end-to-end encrypted. When Android devices participating in the network report the location of a Bluetooth tag, the location is end-to-end encrypted using a key that is only accessible to the Bluetooth tag owner and anyone the owner has shared the tag with in the Find My Device app. Only the Bluetooth tag owner (and those they’ve chosen to share access with) can decrypt and view the tag’s location. With end-to-end encrypted location data, Google cannot decrypt, see, or otherwise use the location data.
    • Private, crowdsourced location reports. These end-to-end encrypted locations are contributed to the Find My Device network in a manner that does not allow Google to identify the owners of the nearby Android devices that provided the location data. And when the Find My Device network shows the location and timestamp to the Bluetooth tag’s owner to help them find their belongings, no other information about the nearby Android devices that contributed the data is included.
    • Minimizing network data. End-to-end encrypted location data is minimally buffered and frequently overwritten. In addition, if the network can help find a Bluetooth tag using the owner’s nearby devices (e.g., if their own phone detects the tag), the network will discard crowdsourced reports for the tag.
  • Safety-first Protections: The Find My Device network protects against risks such as use of an unknown Bluetooth tag to stalk or identify another user, including:
    • Aggregation by default. This is a first-of-its-kind safety protection that makes unwanted tracking to a private location, like your home, more difficult. By default, the Find My Device network requires multiple nearby Android devices to detect a tag before reporting its location to the tag's owner. Our research found that the Find My Device network is most valuable in public settings like cafes and airports, where there are likely many devices nearby. By implementing aggregation before showing a tag’s location to its owner, the network can take advantage of its biggest strength – over a billion Android devices that can participate. This helps tag owners find their lost devices in these busier locations while prioritizing safety from unwanted tracking near private locations. In less busy areas, last known location and Nest finding are reliable ways to locate items.
    • At home protection. If a user has chosen to save their home address in their Google Account, their Android device will also ensure that it does not contribute crowdsourced location reports to the Find My Device network when it is near the user’s home. This provides additional protection on top of aggregation by default against unwanted tracking near private locations.
    • Rate limiting and throttling. The Find My Device network limits the number of times that a nearby Android device can contribute a location report for a particular Bluetooth tag. The network also throttles how frequently the owner of a Bluetooth tag can request an updated location for the tag. We've found that lost items are typically left behind in stationary spots. For example, you lose your keys at the cafe, and they stay at the table where you had your morning coffee. Meanwhile, a malicious user is often trying to engage in real-time tracking of a person. By applying rate limiting and throttling to reduce how often the location of a device is updated, the network continues to be helpful for finding items, like your lost checked baggage on a trip, while helping mitigate the risk of real-time tracking.
    • Unknown tracker alerts. The Find My Device network is also compliant with the integration version of the joint industry standard for unwanted tracking. Being compliant with the integration version of the standard means that both Android and iOS users will receive unknown tracker alerts if the on-device algorithm detects that someone may be using a Find My Device network-compatible tag to track them without their knowledge, proactively alerting the user through a notification on their phone.
  • User Controls: Android users always have full control over which of their devices participate in the Find My Device network and how those devices participate. Users can either stick with the default and contribute to aggregated location reporting, opt into contributing non-aggregated locations, or turn the network off altogether. Find My Device also provides the ability to secure or erase data from a lost device.

In addition to careful security architectural design, the new Find My Device network has undergone internal Android red team testing. The Find My Device network has also been added to the Android security vulnerability rewards program to take advantage of Android’s global ecosystem of security researchers. We’re also engaging with select researchers through our private grant program to encourage more targeted research.

Prioritizing user safety on Find My Device

Together, these multi-layered user protections help mitigate potential risks to user privacy and safety while allowing users to effectively locate and recover lost devices.

As bad actors continue to look for new ways to exploit users, our work to help keep users safe on Android is never over. We have an unwavering commitment to continue to improve user protections on Find My Device and prioritize user safety.


For more information about Find My Device on Android, please visit our help center. You can read the Find My Device Network Accessory specification here.



The Domain Name System (DNS) is a fundamental protocol used on the Internet to translate human-readable domain names (e.g., www.example.com) into numeric IP addresses (e.g., 192.0.2.1) so that devices and servers can find and communicate with each other. When a user enters a domain name in their browser, the DNS resolver (e.g. Google Public DNS) locates the authoritative DNS nameservers for the requested name, and queries one or more of them to obtain the IP address(es) to return to the browser.

When DNS was launched in the early 1980s as a trusted, content-neutral infrastructure, security was not yet a pressing concern, however, as the Internet grew DNS became vulnerable to various attacks. In this post, we will look at DNS cache poisoning attacks and how Google Public DNS addresses the risks associated with them.

DNS Cache Poisoning Attacks

DNS lookups in most applications are forwarded to a caching resolver (which could be local or an open resolver like. Google Public DNS). The path from a client to the resolver is usually on a local network or can be protected using encrypted transports like DoH, DoT. The resolver queries authoritative DNS servers to obtain answers for user queries. This communication primarily occurs over UDP, an insecure connectionless protocol, in which messages can be easily spoofed including the source IP address. The content of DNS queries may be sufficiently predictable that even an off-path attacker can, with enough effort, forge responses that appear to be from the queried authoritative server. This response will be cached if it matches the necessary fields and arrives before the authentic response. This type of attack is called a cache poisoning attack, which can cause great harm once successful. According to RFC 5452, the probability of success is very high without protection. Forged DNS responses can lead to denial of service, or may even compromise application security. For an excellent introduction to cache poisoning attacks, please see “An Illustrated Guide to the Kaminsky DNS Vulnerability”.

Cache poisoning mitigations in Google Public DNS

Improving DNS security has been a goal of Google Public DNS since our launch in 2009. We take a multi-pronged approach to protect users against DNS cache-poisoning attacks. There is no silver bullet or countermeasure that entirely solves the problem, but in combination they make successful attacks substantially more difficult.


RFC 5452 And DNS Cookies

We have implemented the basic countermeasures outlined in RFC 5452 namely randomizing query source ports and query IDs. But these measures alone are not sufficient (see page 8 of our OARC 38 presentation).

We have therefore also implemented support for RFC 7873 (DNS Cookies) which can make spoofing impractical if it’s supported by the authoritative server. Measurements indicate that the DNS Cookies do not provide sufficient coverage, even though around 40% of nameservers by IP support DNS Cookies, these account for less than 10% of overall query volume. In addition, many non-compliant nameservers return incorrect or ambiguous responses for queries with DNS Cookies, which creates further deployment obstacles. For now, we’ve enabled DNS Cookies through manual configuration, primarily for selected TLD zones.

Case Randomization (0x20)

The query name case randomization mechanism, originally proposed in a March 2008 draft “Use of Bit 0x20 in DNS Labels to Improve Transaction Identity”, however, is highly effective, because all but a small minority of nameservers are compatible with query name case randomization. We have been performing case randomization of query names since 2009 to a small set of chosen nameservers that handle only a minority of our query volume. 

In 2022 we started work on enabling case randomization by default, which when used, the query name in the question section is randomized and the DNS server’s response is expected to match the case-randomized query name exactly in the request. For example, if “ExaMplE.CoM” is the name sent in the request, the name in the question section of the response must also be “ExaMplE.CoM” rather than, e.g., “example.com.” Responses that fail to preserve the case of the query name may be dropped as potential cache poisoning attacks (and retried over TCP).

We are happy to announce that we’ve already enabled and deployed this feature globally by default. It covers over 90% of our UDP traffic to nameservers, significantly reducing the risk of cache poisoning attacks.

Meanwhile, we maintain an exception list and implement fallback mechanisms to prevent potential issues with non-conformant nameservers. However we strongly recommend that nameserver implementations preserve the query case in the response.

DNS-over-TLS

In addition to case randomization, we’ve deployed DNS-over-TLS to authoritative nameservers (ADoT), following procedures described in RFC 9539 (Unilateral Opportunistic Deployment of Encrypted Recursive-to-Authoritative DNS). Real world measurements show that ADoT has a higher success rate and comparable latency to UDP. And ADoT is in use for around 6% of egress traffic. At the cost of some CPU and memory, we get both security and privacy for nameserver queries without DNS compliance issues.

Summary

Google Public DNS takes security of our users seriously. Through multiple countermeasures to cache poisoning attacks, we aim to provide a more secure and reliable DNS resolution service, enhancing the overall Internet experience for users worldwide. With the measures described above we are able to provide protection against passive attacks for over 90% of authoritative queries.


To enhance DNS security, we recommend that DNS server operators support one or more of the  security mechanisms described here. We are also working with the DNS community to improve DNS security. Please see our presentations at DNS-OARC 38 and 40 for more technical details.

With steady improvements to Android userspace and kernel security, we have noticed an increasing interest from security researchers directed towards lower level firmware. This area has traditionally received less scrutiny, but is critical to device security. We have previously discussed how we have been prioritizing firmware security, and how to apply mitigations in a firmware environment to mitigate unknown vulnerabilities.

In this post we will show how the Kernel Address Sanitizer (KASan) can be used to proactively discover vulnerabilities earlier in the development lifecycle. Despite the narrow application implied by its name, KASan is applicable to a wide-range of firmware targets. Using KASan enabled builds during testing and/or fuzzing can help catch memory corruption vulnerabilities and stability issues before they land on user devices. We've already used KASan in some firmware targets to proactively find and fix 40+ memory safety bugs and vulnerabilities, including some of critical severity.

Along with this blog post we are releasing a small project which demonstrates an implementation of KASan for bare-metal targets leveraging the QEMU system emulator. Readers can refer to this implementation for technical details while following the blog post.

Address Sanitizer (ASan) overview

Address sanitizer is a compiler-based instrumentation tool used to identify invalid memory access operations during runtime. It is capable of detecting the following classes of temporal and spatial memory safety bugs:

  • out-of-bounds memory access
  • use-after-free
  • double/invalid free
  • use-after-return

ASan relies on the compiler to instrument code with dynamic checks for virtual addresses used in load/store operations. A separate runtime library defines the instrumentation hooks for the heap memory and error reporting. For most user-space targets (such as aarch64-linux-android) ASan can be enabled as simply as using the -fsanitize=address compiler option for Clang due to existing support of this target both in the toolchain and in the libclang_rt runtime.

However, the situation is rather different for bare-metal code which is frequently built with the none system targets, such as arm-none-eabi. Unlike traditional user-space programs, bare-metal code running inside an embedded system often doesn’t have a common runtime implementation. As such, LLVM can’t provide a default runtime for these environments.

To provide custom implementations for the necessary runtime routines, the Clang toolchain exposes an interface for address sanitization through the -fsanitize=kernel-address compiler option. The KASan runtime routines implemented in the Linux kernel serve as a great example of how to define a KASan runtime for targets which aren’t supported by default with -fsanitize=address. We'll demonstrate how to use the version of address sanitizer originally built for the kernel on other bare-metal targets.

KASan 101

Let’s take a look at the KASan major building blocks from a high-level perspective (a thorough explanation of how ASan works under-the-hood is provided in this whitepaper).

The main idea behind KASan is that every memory access operation, such as load/store instructions and memory copy functions (for example, memmove and memcpy), are instrumented with code which performs verification of the destination/source memory regions. KASan only allows the memory access operations which use valid memory regions. When KASan detects memory access to a memory region which is invalid (that is, the memory has been already freed or access is out-of-bounds) then it reports this violation to the system.

The state of memory regions covered by KASan is maintained in a dedicated area called shadow memory. Every byte in the shadow memory corresponds to a single fixed-size memory region covered by KASan (typically 8-bytes) and encodes its state: whether the corresponding memory region has been allocated or freed and how many bytes in the memory region are accessible.

Therefore, to enable KASan for a bare-metal target we would need to implement the instrumentation routines which verify validity of memory regions in memory access operations and report KASan violations to the system. In addition we would also need to implement shadow memory management to track the state of memory regions which we want to be covered with KASan.

Enabling KASan for bare-metal firmware

KASan shadow memory

The very first step in enabling KASan for firmware is to reserve a sufficient amount of DRAM for shadow memory. This is a memory region where each byte is used by KASan to track the state of an 8-byte region. This means accommodating the shadow memory requires a dedicated memory region equal to 1/8th the size of the address space covered by KASan.

KASan maps every 8-byte aligned address from the DRAM region into the shadow memory using the following formula:

shadow_address = (target_address >> 3 ) + shadow_memory_base where target_address is the address of a 8-byte memory region which we want to cover with KASan and shadow_memory_base is the base address of the shadow memory area.

Implement a KASan runtime

Once we have the shadow memory tracking the state of every single 8-byte memory region of DRAM we need to implement the necessary runtime routines which KASan instrumentation depends on. For reference, a comprehensive list of runtime routines needed for KASan can be found in the linux/mm/kasan/kasan.h Linux kernel header. However, it might not be necessary to implement all of them and in the following text we focus on the ones which were needed to enable KASan for our target firmware as an example.

Memory access check

The routines __asan_loadXX_noabort, __asan_storeXX_noabort perform verification of memory access at runtime. The symbol XX denotes size of memory access and goes as a power of 2 starting from 1 up to 16. The toolchain instruments every memory load and store operations with these functions so that they are invoked before the memory access operation happens. These routines take as input a pointer to the target memory region to check it against the shadow memory.

If the region state provided by shadow memory doesn’t reveal a violation, then these functions return to the caller. But if any violations (for example, the memory region is accessed after it has been deallocated or there is an out-of-bounds access) are revealed, then these functions report the KASan violation by:

  • Generating a call-stack.
  • Capturing context around the memory regions.
  • Logging the error.
  • Aborting/crashing the system (optional)

Shadow memory management

The routine __asan_set_shadow_YY is used to poison shadow memory for a given address. This routine is used by the toolchain instrumentation to update the state of memory regions. For example, the KASan runtime would use this function to mark memory for local variables on the stack as accessible/poisoned in the epilogue/prologue of the function respectively.

This routine takes as input a target memory address and sets the corresponding byte in shadow memory to the value of YY. Here is an example of some YY values for shadow memory to encode state of 8-byte memory regions:

  • 0x00 -- the entire 8-byte region is accessible
  • 0x01-0x07 -- only the first bytes in the memory region are accessible
  • 0xf1 -- not accessible: stack left red zone
  • 0xf2 -- not accessible: stack mid red zone
  • 0xf3 -- not accessible: stack right red zone
  • 0xfa -- not accessible: globals red zone
  • 0xff -- not accessible

Covering global variables

The routines __asan_register_globals, __asan_unregister_globals are used to poison/unpoison memory for global variables. The KASan runtime calls these functions while processing global constructors/destructors. For instance, the routine __asan_register_globals is invoked for every global variable. It takes as an argument a pointer to a data structure which describes the target global variable: the structure provides the starting address of the variable, its size not including the red zone and size of the global variable with the red zone.

The red zone is extra padding the compiler inserts after the variable to increase the likelihood of detecting an out-of-bounds memory access. Red zones ensure there is extra space between adjacent global variables. It is the responsibility of __asan_register_globals routine to mark the corresponding shadow memory as accessible for the variable and as poisoned for the red zone.

As the readers could infer from its name, the routine __asan_unregister_globals is invoked while processing global destructors and is intended to poison shadow memory for the target global variable. As a result, any memory access to such a global will cause a KASan violation.

Memory copy functions

The KASan compiler instrumentation routines __asan_loadXX_noabort, __asan_storeXX_noabort discussed above are used to verify individual memory load and store operations such as, reading or writing an array element or dereferencing a pointer. However, these routines don't cover memory access in bulk-memory copy functions such as memcpy, memmove, and memset. In many cases these functions are provided by the runtime library or implemented in assembly to optimize for performance.

Therefore, in order to be able to catch invalid memory access in these functions, we would need to provide sanitized versions of memcpy, memmove, and memset functions in our KASan implementation which would verify memory buffers to be valid memory regions.

Avoiding false positives for noreturn functions

Another routine required by KASan is __asan_handle_no_return, to perform cleanup before a noreturn function and avoid false positives on the stack. KASan adds red zones around stack variables at the start of each function, and removes them at the end. If a function does not return normally (for example, in case of longjmp-like functions and exception handling), red zones must be removed explicitly with __asan_handle_no_return.

Hook heap memory allocation routines

Bare-metal code in the vast majority of cases provides its own heap implementation. It is our responsibility to implement an instrumented version of heap memory allocation and freeing routines which enable KASan to detect memory corruption bugs on the heap.

Essentially, we would need to instrument the memory allocator with the code which unpoisons KASan shadow memory corresponding to the allocated memory buffer. Additionally, we may want to insert an extra poisoned red zone memory (which accessing would then generate a KASan violation) to the end of the allocated buffer to increase the likelihood of catching out-of-bounds memory reads/writes.

Similarly, in the memory deallocation routine (such as free) we would need to poison the shadow memory corresponding to the free buffer so that any subsequent access (such as, use-after-free) would generate a KASan violation.

We can go even further by placing the freed memory buffer into a quarantine instead of immediately returning the free memory back to the allocator. This way, the freed memory buffer is suspended in quarantine for some time and will have its KASan shadow bytes poisoned for a longer period of time, increasing the probability of catching a use-after-free access to this buffer.

Enable KASan for heap, stack and global variables

With all the necessary building blocks implemented we are ready to enable KASan for our bare-metal code by applying the following compiler options while building the target with the LLVM toolchain.

The -fsanitize=kernel-address Clang option instructs the compiler to instrument memory load/store operations with the KASan verification routines.

We use the -asan-mapping-offset LLVM option to indicate where we want our shadow memory to be located. For instance, let’s assume that we would like to cover address range 0x40000000 - 0x4fffffff and we want to keep shadow memory at address 0x4A700000. So, we would use -mllvm -asan-mapping-offset=0x42700000 as 0x40000000 >> 3 + 0x42700000 == 0x4A700000.

To cover globals and stack variables with KASan we would need to pass additional options to the compiler: -mllvm -asan-stack=1 -mllvm -asan-globals=1. It’s worth mentioning that instrumenting both globals and stack variables will likely result in an increase in size of the corresponding memory which might need to be accounted for in the linker script.

Finally, to prevent significant increase in size of the code section due to KASan instrumentation we instruct the compiler to always outline KASan checks using the -mllvm -asan-instrumentation-with-call-threshold=0 option. Otherwise, the compiler might inline

__asan_loadXX_noabort, __asan_storeXX_noabort routines for load/store operations resulting in bloating the generated object code.

LLVM has traditionally only supported sanitizers with runtimes for specific targets with predefined runtimes, however we have upstreamed LLVM sanitizer support for bare-metal targets under the assumption that the runtime can be defined for the particular target. You’ll need the latest version of Clang to benefit from this.

Conclusion

Following these steps we managed to enable KASan for a firmware target and use it in pre-production test builds. This led to early discovery of memory corruption issues that were easily remediated due to the actionable reports produced by KASan. These builds can be used with fuzzers to detect edge case bugs that normal testing fails to trigger, yet which can have significant security implications.

Our work with KASan is just one example of the multiple techniques the Android team is exploring to further secure bare-metal firmware in the Android Platform. Ideally we want to avoid introducing memory safety vulnerabilities in the first place so we are working to address this problem through adoption of memory-safe Rust in bare-metal environments. The Android team has developed Rust training which covers bare-metal Rust extensively. We highly encourage others to explore Rust (or other memory-safe languages) as an alternative to C/C++ in their firmware.

If you have any questions, please reach out – we’re here to help!

Acknowledgements: Thank you to Roger Piqueras Jover for contributions to this post, and to Evgenii Stepanov for upstreaming LLVM support for bare-metal sanitizers. Special thanks also to our colleagues who contribute and support our firmware security efforts: Sami Tolvanen, Stephan Somogyi, Stephan Chen, Dominik Maier, Xuan Xing, Farzan Karimi, Pirama Arumuga Nainar, Stephen Hines.

For more than 15 years, Google Safe Browsing has been protecting users from phishing, malware, unwanted software and more, by identifying and warning users about potentially abusive sites on more than 5 billion devices around the world. As attackers grow more sophisticated, we've seen the need for protections that can adapt as quickly as the threats they defend against. That’s why we're excited to announce a new version of Safe Browsing that will provide real-time, privacy-preserving URL protection for people using the Standard protection mode of Safe Browsing in Chrome.

Current landscape

Chrome automatically protects you by flagging potentially dangerous sites and files, hand in hand with Safe Browsing which discovers thousands of unsafe sites every day and adds them to its lists of harmful sites and files.

So far, for privacy and performance reasons, Chrome has first checked sites you visit against a locally-stored list of known unsafe sites which is updated every 30 to 60 minutes – this is done using hash-based checks.


Hash-based check overview

But unsafe sites have adapted — today, the majority of them exist for less than 10 minutes, meaning that by the time the locally-stored list of known unsafe sites is updated, many have slipped through and had the chance to do damage if users happened to visit them during this window of opportunity. Further, Safe Browsing’s list of harmful websites continues to grow at a rapid pace. Not all devices have the resources necessary to maintain this growing list, nor are they always able to receive and apply updates to the list at the frequency necessary to benefit from full protection.

Safe Browsing’s Enhanced protection mode already stays ahead of such threats with technologies such as real-time list checks and AI-based classification of malicious URLs and web pages. We built this mode as an opt-in to give users the choice of sharing more security-related data in order to get stronger security. This mode has shown that checking lists in real time brings significant value, so we decided to bring that to the default Standard protection mode through a new API – one that doesn't share the URLs of sites you visit with Google.

Introducing real-time, privacy-preserving Safe Browsing How it works

In order to transition to real-time protection, checks now need to be performed against a list that is maintained on the Safe Browsing server. The server-side list can include unsafe sites as soon as they are discovered, so it is able to capture sites that switch quickly. It can also grow as large as needed because the Safe Browsing server is not constrained in the same way that user devices are.

Behind the scenes, here's what is happening in Chrome:

  1. When you visit a site, Chrome first checks its cache to see if the address (URL) of the site is already known to be safe (see the “Staying speedy and reliable” section for details).
  2. If the visited URL is not in the cache, it may be unsafe, so a real-time check is necessary.
  3. Chrome obfuscates the URL by following the URL hashing guidance to convert the URL into 32-byte full hashes.
  4. Chrome truncates the full hashes into 4-byte long hash prefixes.
  5. Chrome encrypts the hash prefixes and sends them to a privacy server (see the “Keeping your data private” section for details).
  6. The privacy server removes potential user identifiers and forwards the encrypted hash prefixes to the Safe Browsing server via a TLS connection that mixes requests with many other Chrome users.
  7. The Safe Browsing server decrypts the hash prefixes and matches them against the server-side database, returning full hashes of all unsafe URLs that match one of the hash prefixes sent by Chrome.
  8. After receiving the unsafe full hashes, Chrome checks them against the full hashes of the visited URL.
  9. If any match is found, Chrome will show a warning.
Keeping your data private

In order to preserve user privacy, we have partnered with Fastly, an edge cloud platform that provides content delivery, edge compute, security, and observability services, to operate an Oblivious HTTP (OHTTP) privacy server between Chrome and Safe Browsing – you can learn more about Fastly's commitment to user privacy on their Customer Trust page. With OHTTP, Safe Browsing does not see your IP address, and your Safe Browsing checks are mixed amongst those sent by other Chrome users. This means Safe Browsing cannot correlate the URL checks you send as you browse the web.

Before hash prefixes leave your device, Chrome encrypts them using a public key from Safe Browsing. These encrypted hash prefixes are then sent to the privacy server. Since the privacy server doesn’t know the private key, it cannot decrypt the hash prefixes, which offers privacy from the privacy server itself.

The privacy server then removes potential user identifiers such as your IP address and forwards the encrypted hash prefixes to the Safe Browsing server. The privacy server is operated independently by Fastly, meaning that Google doesn’t have access to potential user identifiers (including IP address and User Agent) from the original request. Once the Safe Browsing server receives the encrypted hash prefixes from the privacy server, it decrypts the hash prefixes with its private key and then continues to check the server-side list.

Ultimately, Safe Browsing sees the hash prefixes of your URL but not your IP address, and the privacy server sees your IP address but not the hash prefixes. No single party has access to both your identity and the hash prefixes. As such, your browsing activity remains private.

Real-time check overview

Staying speedy and reliable

Compared with the hash-based check, the real-time check requires sending a request to a server, which adds additional latency. We have employed a few techniques to make sure your browsing experience continues to be smooth and responsive.

First, before performing the real-time check, Chrome checks against a global and local cache on your device to avoid unnecessary delay.

  • The global cache is a list of hashes of known-safe URLs that is served by Safe Browsing. Chrome fetches it in the background. If any full hash of the URL is found in the global cache, Chrome will consider it less risky and perform a hash-based check instead.
  • The local cache, on the other hand, is a list of full hashes that are saved from previous Safe Browsing checks. If there is a match in the local cache, and the cache has not yet expired, Chrome will not send a real-time request to the Safe Browsing server.

Both caches are stored in memory, so it is much faster to check them than sending a real-time request over the network.

In addition, Chrome follows a fallback mechanism in case of unsuccessful or slow requests. If the real-time request fails consecutively, Chrome will enter a back-off mode and downgrade the checks to hash-based checks for a certain period.

We are also in the process of introducing an asynchronous mechanism, which will allow the site to load while the real-time check is in progress. This will improve the user experience, as the real-time check won’t block page load.

What real-time, privacy-preserving URL protection means for you Chrome users

With the latest release of Chrome for desktop, Android, and iOS, we’re upgrading the Standard protection mode of Safe Browsing so it will now check sites using Safe Browsing’s real-time protection protocol, without sharing your browsing history with Google. You don't need to take any action to benefit from this improved functionality.

If you want more protection, we still encourage you to turn on the Enhanced protection mode of Safe Browsing. You might wonder why you need enhanced protection when you'll be getting real-time URL protection in Standard protection – this is because in Standard protection mode, the real-time feature can only protect you from sites that Safe Browsing has already confirmed to be unsafe. On the other hand, Enhanced protection mode is able to use additional information together with advanced machine learning models to protect you from sites that Safe Browsing may not yet have confirmed to be unsafe, for example because the site was only very recently created or is cloaking its true behavior to Safe Browsing’s detection systems.

Enhanced protection also continues to offer protection beyond real-time URL checks, for example by providing deep scans for suspicious files and extra protection from suspicious Chrome extensions.

Enterprises

The real-time feature of the Standard protection mode of Safe Browsing is on by default for Chrome. If needed, it may be configured using the policy SafeBrowsingProxiedRealTimeChecksAllowed. It is also worth noting that in order for this feature to work in Chrome, enterprises may need to explicitly allow traffic to the Fastly privacy server. If the server is not reachable, Chrome will downgrade the checks to hash-based checks.

Developers

While Chrome is the first surface where these protections are available, we plan to make them available to eligible developers for non-commercial use cases via the Safe Browsing API. Using the API, developers and privacy server operators can partner to better protect their products’ users from fast-moving malicious actors in a privacy-preserving manner. To learn more, keep an eye out for our upcoming developer documentation to be published on the Google for Developers site.