We love this one. https://themarkup.org/blacklight Blacklight. A Real-Time Website Privacy InspectorBy Surya MattuWho is peeking over your shoulder while you work, watch videos, learn, explore, and shop on the internet? Enter the address of any website, and Blacklight will scan it and reveal the specific user-tracking technologies on the site—and who’s getting your data. You may be surprised at what you learn. Read about what they learned running this tool against different websites: https://themarkup.org/series/blacklight and how they built it..... Blacklight is a real-time website privacy inspector. The tool emulates how a user might be surveilled while browsing the web. Users type a URL into Blacklight, and it visits the requested website, scans for known types of privacy violations, and returns an instant privacy analysis of the inspected site. Blacklight works by visiting each website with a headless browser, running custom software built by The Markup. This software monitors which scripts on that website are potentially surveilling the user by performing seven different tests, each investigating a specific, known method of surveillance. The types of surveillance that Blacklight seeks to identify are:
Blacklight was built using the NodeJS Javascript environment, the Puppeteer Node library, which provides high-level control over a Chromium (open-source Chrome) browser. When a user enters a URL into Blacklight, the tool opens a headless web browser with a fresh profile and visits its homepage as well as an additional randomly selected page deeper inside the same website. Who’s peeking over your shoulder as you work, learn, or explore the internet?Try out Blacklight here. Enter a website, and Blacklight will scan it for user-tracking technologies — and who’s getting your data. Enter a URL for Blacklight to scanWhile the browser is visiting the website, it runs custom software in the background that monitors scripts and network requests to observe when and how user data is being collected. To monitor scripts, Blacklight modifies various fingerprintable properties of the browser’s Window API. This allows Blacklight to log which script made a particular function call, using the Stacktrace-js package. The network requests are collected using a monitoring tool included in Puppeteer’s API. Blacklight uses the script data and network requests to run the seven tests described above. Afterward, it closes the browser and generates an instant report for the user. It records a list of all the URLs that the inspected website requests. In addition, it makes a list of all domains and subdomains that were requested. The tool we provide to the public will not save those lists unless the user chooses to share results with us through an option in the tool. We define domain names using the Public Suffix + 1 method. We define first-party domain as any domain that matches the website visited, including subdomains. We define third-party as any domain that does not match the website visited. The tool compares the list of third-party domains from the website requests with DuckDuckGo’s Tracker Radar dataset. This data merge allows Blacklight to add the following information about the third-party domains found on the inspected site:
Blacklight runs tests based on the root URL of the page entered by a user into the tool. For example, if a user types in https://example.com/sports, Blacklight starts its inspection at https://example.com and disregards the /sports path. If a user types in https://sports.example.com, Blacklight starts its inspection at https://sports.example.com. Report Deeply and Fix Things Because it turns out moving fast and breaking things broke some super important things. Blacklight results for each requested domain are cached for 24 hours, and these cached reports are delivered in response to subsequent user requests for the same website during those 24 hours. This is designed to prevent the tool from being used maliciously to overwhelm a website with thousands of automated visits. Blacklight will also tell users whether their results are high, low, or about average compared with what the tool found on the 100,000 most popular websites as ranked by the Tranco List. This is described in more detail below. The Blacklight code base is open source and available on Github; it can also be downloaded as an NPM module. There are limitations to our analysis. Blacklight emulates a user visiting a website, but its automated behavior is different from human behavior, and that behavior may trigger different types of surveillance. For instance, an automated request might trigger more fraud detection but fewer ads. Given the dynamic nature of web-based technology, it is also possible that some of these tests will become out-of-date over time. And new legitimate-use cases for the techniques Blacklight flags could emerge that would not be listed in the tool’s caveats. For this reason, Blacklight results should not be taken as the final word on potential privacy violations by a given website. Rather, they should be treated as an initial automated inspection that requires further investigation before a definitive claim can be made. Previous WorkBlacklight is built on the foundation of various privacy census tools built over the past decade. It runs Javascript instrumentation, which enables it to monitor calls to the browsers’ Javascript API. This is based on OpenWPM, an open-source tool for web privacy measurement built by Steven Englehardt, Gunes Acar, Dillon Reisman, and Arvind Narayanan at Princeton University. It is now maintained by Mozilla. OpenWPM was used to power Princeton’s Web Transparency and Accountability Project, which monitored websites and services to discover companies’ data collection, data use, and deceptive practices. Through numerous studies conducted between 2015 and 2019, Princeton researchers uncovered the presence of many privacy-infringing technologies. These included browser fingerprinting and cookie syncing as well as how session replay scripts collect passwords and sensitive user data. One notable example is the exfiltration of prescription data and health-conditions data from walgreens.com. Five of the seven tests Blacklight runs are based on the techniques described in the Princeton research mentioned above. These tests are canvas fingerprinting, key logging, session recording, and third-party cookies. OpenWPM incorporates code and techniques from other privacy inspection tools, including FourthParty, Privacy Badger, and FP Detective:
Other projects that have influenced Blacklight’s development include the Web Privacy Census, conducted at UC Berkeley in 2012, and the Wall Street Journal’s “What They Know” series. How We Analyzed Each Type of TrackingThird-Party CookiesThird-party cookies are a small piece of data that tracking companies store in your web browser when you visit a website. This bit of text—usually a unique number or string of characters—identifies you when you visit other websites that contain tracking code from the same company. Third-party cookies are used by hundreds of companies to build dossiers about users and deliver customized ads based on their behavior. Popular web browsers Edge, Brave, Firefox, and Safari all block third-party tracking cookies by default, and Chrome has announced that it will phase them out. What Blacklight Tests Blacklight monitors network requests for the “Set-Cookie” header and observes all domains that set cookies using the document.cookie javascript property. Blacklight identifies third-party cookies as those whose domains do not match the domain of the website being visited. We look up these third-party domains in DuckDuckGo’s Tracker Radar data to find out who owns them, how prevalent they are, and what kinds of services they provide. Key LoggingKey logging is when a first or third party monitors the text that you type into a webpage before you hit the submit button. This technique has been used for a variety of purposes, including identifying anonymous web users by matching them to postal addresses and real names. There are other reasons for key logging, such as providing autocomplete functionality. Blacklight cannot determine the intent behind the inspected website’s use of this technique. What Blacklight Tests In order to test whether this is happening on a given website, Blacklight types predetermined text (see Appendix) in all input fields but never clicks on a submit button. It monitors network requests to see if the data that was entered was sent to any servers. Session RecordingSession recording is technology that allows a third party to monitor and record all of a user’s behavior on a webpage—including mouse movements, clicks, scrolling down the page, and anything you type into a form even if you don’t click submit. In a 2017 study, researchers at Princeton University found that session recorders were collecting sensitive information such as passwords and credit card numbers. When the researchers contacted the companies in question, most responded quickly and fixed the underlying cause of the data leak. However, the research highlights that these aren’t simply bugs but rather insecure practices that the researchers say should be stopped entirely. Most companies that offer session recording say they use the data to provide their customers—the websites installing the technology—meaningful insights on how to improve a user’s experience on the website. One company, Inspectlet, describes its service as watching “individual visitors use your site as if you’re looking over their shoulders.” (Inspectlet did not respond to an email seeking comment.) Credit:Inspectlet Caption: Screenshot from Inspectlet, a known session recording provider.What Blacklight Tests We define session recording as the loading of a specific type of script by a company that we know to be providing session recording services. Blacklight monitors the network requests for specific URL substrings that appear only when session recording is taking place, according to a list created by researchers at Princeton University in 2017. Report Deeply and Fix ThingsBecause it turns out moving fast and breaking things broke some super important things. Sometimes key logging is used as part of session recording. In those cases, Blacklight would correctly report the session recorder as both key logging and session recording because we observed both, even though both tests are identifying the same script. Blacklight accurately detects when a website loads these scripts—but companies typically record only a sample of website visits, so not every user is being recorded on every visit. Canvas FingerprintingFingerprinting describes a group of techniques that try to identify your browser without setting a cookie. They can identify you even if you block all cookies. Canvas fingerprinting is a type of fingerprinting that identifies users by drawing shapes and text on a user’s webpage and noting the minor differences in the way they are rendered. Caption: Four examples of canvas fingerprinting found with Blacklight.These differences in font rendering, smoothing, and anti-aliasing and other features are used by marketers and others to identify individual devices. All of the major internet browsers, except Chrome, try to counter canvas fingerprinting—either by not fulfilling data requests for scripts known to have engaged in the practice or by trying to standardize users’ fingerprints. The image below is an example of the type of canvas images used by fingerprinting scripts. These canvases are usually invisible to the user. What Blacklight Tests We follow the methodology described in this paper by researchers at Princeton University to identify when the HTML canvas element is used for tracking purposes. The parameters we use to identify canvases that are being drawn for fingerprinting purposes are:
Ad TrackersAd trackers are technologies that identify and collect information about users. These technologies usually (but not always) appear with some level of consent from the website owners. They are used to collect website user analytics, for ad-targeting, and by data brokers and other information collectors to build user profiles. They usually take the form of Javascript scripts or web beacons. Web beacons are small 1px by 1px images that are placed on a website for tracking purposes by third parties. Using this technique, a third party can determine behaviors including when a particular user went to a site, the kind of browser, and what IP address it used. What Blacklight Tests Blacklight checks all network requests against the EasyPrivacy list, which contains URLs and URL substrings that are known to be used for tracking. Blacklight monitors the network activity for requests being made to these URLs and substrings. Blacklight only records requests being made to third-party domains. It ignores any URL patterns in the EasyPrivacy list that match a first-party domain. For example, the EFF hosts its own analytics, and that results in requests to “https://anon-stats.eff.org,” their analytics subdomain. If a user types in https://eff.org, Blacklight does not consider calls to https://anon-stats.eff.org to be a third-party request. We look up these third-party domains in DuckDuckGo’s Tracker Radar data set to find out who owns them, how prevalent they are, and what kinds of services they provide. We only include third-party domains that belong to the “Ad Motivated Tracking” categories defined in the Tracker Radar data set. Facebook PixelThe Facebook pixel is a piece of code Facebook created that allows other websites to target their visitors later with ads on Facebook. Common actions that can be tracked by pixel include viewing a page or specific content, adding payment information, or making a purchase. What Blacklight Tests Blacklight looks for network requests from the site going to Facebook and looks in the URL query parameters for data that matches the schema of what is described in the documentation for Facebook’s pixel. We look for three different types of data: “standard events,” “custom events” and “advanced matching.” Google Analytics’ “Remarketing Audiences”Google Analytics is the most popular website analytics platform in use today. According to whotracks.me 41.7 percent of web traffic is analyzed by Google Analytics. While most of the functionality of this service is to provide developers and website owners with information on how their audience is engaging with their website, the tool also allows the website to make custom audience lists based on user behavior and then target ads to those visitors across the internet using Google Ads and Display & Video 360. Blacklight examines inspected sites for the presence of the tool, not how it is used. What Blacklight Tests Blacklight looks for network requests from the inspected site going to a URL beginning with “stats.g.doubleclick” that also contains the “UA-” Google account identifier prefix. This is described in more detail in Google Analytics developer documentation. SurveyTo determine the prevalence of tracking technologies on the internet both for context in Blacklight and for accompanying news stories, we ran the 100,000 most popular websites as defined by the Tranco List through Blacklight. The data and analysis code can be found on Github . Blacklight successfully captured data for 81,617 of those URLs. The rest either failed to resolve, timed out on multiple attempts, or didn’t load a webpage. The percentages listed below are for the 81,617 successful captures. Some of the analysis goes beyond what appears on the tool. The key findings from our survey are as follows:
LimitationsBlacklight’s analysis is limited by four main factors:
Regarding false positives, when Blacklight visits a site, that site can see the request is coming from computers hosted by Amazon’s AWS cloud infrastructure. Because botnets are often run on cloud infrastructure, our tool could trigger bot-detection software on the website, including canvas fingerprinting. This could result in false positives for the canvas fingerprinting test where the purpose of the test is not to track users but rather to detect botnets. In order to test this, we took a random sample of 1,000 sites from the top websites from the Tranco List that we had already run through Blacklight on AWS. We ran this sample through Blacklight software on our computer locally at a residential IP address in New York City. We concluded that the results of a Blacklight inspection locally are very similar, but not exactly the same, as running it on cloud infrastructure. ↩︎ linkResults for Sample: Local Computer and AWSLocal AWS Canvas fingerprinting8%10% Session recording18%19% Key logging4%6% Median number of third-party cookies45 Median number of third-party trackers recorded78Not all surveillance activity that is imperceptible to the user is necessarily malicious. For instance, canvas fingerprinting is used for fraud prevention because it can identify a device. And key logging can be used to provide autocomplete functionality. Blacklight does not attempt to identify the intent of any particular tracking technology it finds. Nor can Blacklight determine exactly how a website uses the data it collects on a user when loading session recording scripts and monitoring user behavior, such as mouse movements and keystrokes. Blacklight does not check the terms of use or privacy policies of the websites it visits to see whether they disclose their surveillance activities. ↩︎ linkAppendixInput field values The table below lists the values we programmed Blacklight to type into input fields on websites. We used the Mozilla autocomplete attribute write-up as our reference. Blacklight also checks for the base64, md5, sha256 and sha512 versions of these values. Autocomplete Attribute Blacklight Value Date01/01/2026 [email protected] PasswordSUPERS3CR3T_PASSWORD SearchTheMarkup TextIdaaaaTarbell URLhttps://themarkup.org OrganizationThe Markup Organization TitleNon-profit newsroom Current PasswordS3CR3T_CURRENT_PASSWORD New PasswordS3CR3T_NEW_PASSWORD Usernameidaaaa_tarbell Family NameTarbell Given NameIdaaaa NameIdaaaaTarbell Street AddressPO Box #1103 Address Line 1PO Box #1103 Postal Code10159 CC-NameIDAAAATARBELL CC-Given-NameIDAAAA CC-Family-NameTARBELL CC-Number4479846060020724 CC-Exp01/2026 CC-TypeVisa Transaction Amount13371337 ↩︎ linkAcknowledgementsWe thank Gunes Acar (KU Leven), Steven Englehardt (Mozilla), and Arvind Narayanan and Jonathan Mayer (Princeton, CITP) for comments and suggestions on an earlier draft. Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. Archives
January 2024
Categories |