Analyzing Accessibility of Wikipedia Projects Around the World

This study, conducted by the Internet Monitor project at the Berkman Klein Center for Internet & Society, analyzes the scope of government-sponsored censorship of Wikimedia sites around the world. The study finds that, as of June 2016, China was likely censoring the Chinese language Wikipedia project, and Thailand and Uzbekistan were likely interfering intermittently with specific language projects of Wikipedia as well. However, considering the widespread use of filtering technologies and the vast coverage of Wikipedia, our study finds that, as of June 2016, there was relatively little censorship of Wikipedia globally. In fact, our study finds there was less censorship in June 2016 than before Wikipedia’s transition to HTTPS-only content delivery in June 2015. HTTPS prevents censors from seeing which page a user is viewing, which means censors must choose between blocking the entire site and allowing access to all articles. This finding suggests that the shift to HTTPS has been a good one in terms of ensuring accessibility to knowledge. The study identifies and documents the blocking of Wikipedia content using two complementary data collection and analysis strategies: a client-side system that collects data from the perspective of users around the globe and a server-side tool to analyze traffic coming in to Wikipedia servers. Both client- and server-side methods detected events that we consider likely related to censorship, in addition to a large number of suspicious events that remain unexplained. The report features results of our data analysis and insights into the state of access to Wikipedia content in 15 select countries.


Introduction
As one of the largest online repositories of user-generated content in the world, covering topics that range from the general reference 3 to the highly controversial, 4 Wikipedia has repeatedly found itself the target of government censors in countries ranging from China to Iran to Uzbekistan. In some cases, individual articles have been singled out: Turkey has blocked a handful of articles related to reproductive biology, as well as at least one political article; 5 in 2008, a number of ISPs in the United Kingdom blocked access to an article about the German band Scorpion's album, "Virgin Killer," the album art for which was a provocative image of a naked child. 6 In other cases, one or two offending articles have prompted wholesale blocks of the site: Russia has intermittently blocked access to all of Wikipedia out of concerns around articles related to the smoking of marijuana; 7 and in 2006, Pakistan temporarily blocked the site in response to an article on "Draw Mohammed Day," which violated certain religious prohibitions against visual depictions of Mohammed. 8 Syria, 9 China, 10 Iran, 11 Tunisia, 12 and Uzbekistan 13 have all blacklisted the site at various times without publicly citing specific content concerns.
A detailed look at the filtering of specific Wikipedia articles can serve as a window into the kinds of content-political, historical, religious, sexual, cultural, drug-or alcohol-related-that trigger censorship in different countries. Censorship of Wikipedia became slightly more complex, however, Both client-and server-side methods detected events that we consider likely related to censorship, in addition to a large number of suspicious events that remain unexplained. The blocking of Chinese Wikipedia in China starting in May 2015 was identified in the server-side article data, the server-side project data, and the client-side data. We identified a number of articles that appeared to be censored on Persian Wikipedia prior to the transition to solely HTTPS. Our client-side analysis witnessed transitory but intentional blocking of Yiddish Wikipedia in Thailand, as well as an unconfirmed but highly suspicious inability to access Uzbek Wikipedia from Uzbekistan. This latter event correlated with a highly anomalous decrease in traffic from Uzbekistan to Uzbek Wikipedia apparent in the server-side data. Article analysis uncovered a suspicious decrease in historical traffic to Vietnamese articles related to sex and sexuality. Analysis of project-level data uncovered a number of significant decreases in traffic from various countries that correlated with in-country events. These events ranged from natural disasters to political upheaval and affected access not only to Wikipedia but access to the Internet more broadly.

Methods
While this study has the simply stated goal of analyzing the accessibility of Wikipedia around the world, the methods required are more complex. We broke down the problem into three separate questions: where is Wikipedia blocked, how is Wikipedia blocked, and why is Wikipedia blocked.
To assess where Wikipedia is currently blocked, we used two methods. One looked at the levels of traffic to Wikimedia's servers, and one made requests for the various Wikipedia projects from vantage points around the world. We refer to these two methods respectively as "server-side" and "client-side" analysis throughout the report.
Our server-side data analysis consisted of running an anomaly detection algorithm 21 on the daily number of requests from every country to each of Wikipedia's 292 language projects. 22 This data was available from May 2015 through June 2016, and we were given access to this data on Wikimedia's servers under a non-disclosure agreement. When run against this data, the anomaly detection algorithm output an "anomalousness" score for each day's number of requests, where a negative score meant fewer requests than expected and a positive score meant more requests than expected. The resulting anomalies were then filtered to only the most negative anomalies. Graphs of these anomalous events were generated and then manually reviewed for patterns that might indicate 21 This algorithm consists mainly of Robust Principal Component Analysis and is described in more detail in Appendix E. 22 For a list of the projects, see Appendix A.

Analyzing Accessibility of Wikipedia
Projects Around the World INTERNET MONITOR possible censorship events. For cases in which we were interested in specific countries, we generated graphs regardless of the automatically detected anomalies and manually reviewed these.
Our client-side analysis consisted of performing repeated requests to each of Wikipedia's projects from 41 network vantage points located in 40 countries. These countries were chosen because they made up the entirety of our testing network as of June 2016. 23 From each of our test locations, we requested domains of the pattern "http://(project code).wikipedia.org/wiki/," where "(project code)" is the code given by Wikimedia to each of Wikipedia's various language projects (e.g., "http://en.wikipedia.org/wiki/" for English Wikipedia). It is important to note that because we did not have access at the time of testing to in-country DNS servers, all DNS resolution took place using Google's public DNS servers (8.8.8.8 and 8.8.4.4). This means we were unable to detect any manipulation of requests for Wikipedia that took place only at the DNS level. 24 For each request we performed, we collected the time it took for the request to complete, the final URL of the response after we followed all redirects, and a screenshot of the resulting page as would be seen by the user. For any request that failed on the initial attempt, we repeated the request until we either received a successful response or it was deemed the domain was likely unavailable from the vantage point. Once all the responses were collected, we reviewed the collected data for any irregularities that might indicate blocking or throttling.
Originally, to answer how Wikipedia might be censored, we intended to use full packet captures of our client tests to identify the precise technological method used to interfere with requests. For example, packet captures could be used to discriminate between IP blocking, injected TCP reset packets, DNS poisoning, injected HTTP redirects, TLS certificate spoofing, or other methods. It is also sometimes possible to identify the use of specific censorship products by looking for distinctive traits they might leave in packet captures. 25 26 Unfortunately, technical limitations in the deployment of our client network prevented us from collecting these packet captures. Therefore, in witnessed cases of blocking, we could do little but speculate as to the exact technological method of censorship.
Apart from (honest) statements of governments and ISPs, the best way we have to learn about why censors block what they do is to look at historical actions for clues to their motivations. To that end, we used two methods to build context around censorship events that might help us understand motivations. First, we performed traditional research to identify and summarize key themes in the history of censorship in several countries around the world. Second, we attempted to use traffic data to specific Wikipedia articles to locate historical instances of potential censorship with the hope that these historical instances would surface themes and help bolster existing research. 23 The full list of the countries in which our test nodes were located is provided in Appendix B. 24 Further detail of our client-side collection and its potential drawbacks is provided in Appendix B. 25 "Behind Blue Coat: Investigations of commercial filtering in Syria and Burma," Nov 9, 2011, Citizen Lab, https://citizenlab.org/2011/11/behind-blue-coat/. 26

Analyzing Accessibility of Wikipedia
Projects Around the World

INTERNET MONITOR
Our method of detecting potential censorship of articles using traffic data was fairly intuitive. We started with the hypothesis that if an article has an amount of traffic such that the number of requests per some chosen period of time is rarely zero, and then that article is censored for a sizable portion of its audience, traffic to that article will likely decrease a detectable amount. For example, if an article typically sees around 100 requests per day, and it suddenly drops to 10 requests per day for a week, we can assume something has changed. That change event would then be investigated to identify potential causes. To search for such events, we built an anomaly detection pipeline that could automatically detect significant deviations from the normal pattern of requests. 27 We then detected anomalies in the daily request histories from December 2011 through late April 2016 for approximately 1.7 million articles. Our method of selecting this set of articles was designed to favor articles that we considered more likely to be censored. The final set of 1.7 million articles covered 286 distinct Wikipedia language projects (out of the total 292), 132 of which were represented by more than 10,000 articles. All of the detected anomalies were collected in a database that allowed for easy searching.
It is important to note that daily requests to articles were not broken out by geography. Instead, each data point represented the number of requests in a day for a given article from everywhere on Earth. This meant that we could not definitively attribute any given article anomaly to requests from a particular country. Instead, we could only assume the anomaly was most likely related to the country that constituted the largest share of requests to the article's language project. For example, if we located an anomaly in the request history of an article on Persian Wikipedia, the fact that 83.5% of requests for Persian Wikipedia come from Iran gave us some confidence that the anomaly could be related to Iran. On the other hand, if we located an anomaly in an article on English Wikipedia, we felt that we could not claim the anomaly was related to any single country, as nine countries each contribute more than one percent of the total requests for English Wikipedia. 28 While request data broken out by both article and geography existed, only a small amount of this data was relevant to our analysis, and we therefore opted to use a different data source. 29 This was an unfortunate loss of some of the power of interpretability that we had hoped to achieve with our methodology.
Once we had run all of the article histories through our pipeline, we set about manually reviewing and investigating the anomalies that represented the most severe and longest lasting decreases in request traffic. This manual review phase was both necessary and slow. We found it necessary because large decreases in traffic can be caused by many different processes (national holidays, network outages, articles moving or being redirected, bot activity, etc.), so determining whether or not an anomaly is likely a censorship event is an evidence building process. Unfortunately, the large volume of detected anomalies and the fact that our data analysis process included a good deal of 27 This is the same as the algorithm used for Wikipedia project-level analysis. For more information about the process and the algorithm (Robust Principal Component Analysis), see Appendix E. 28

Analyzing Accessibility of Wikipedia
Projects Around the World manual review meant we were not able to investigate all the significant anomalies individually. A full accounting of our article-level analysis methodology and the issues we encountered while implementing it are provided in Appendix E.
We use three kinds of graphs throughout this report. In the simplest case, we show the number of daily requests for a single article over time. In these graphs, there are vertical colored bars to indicate detected anomalies. Blue vertical bars indicate fewer requests than expected while red bars indicate more requests than expected. The depth of the hue roughly indicates the anomalousness of each anomaly relative to the other anomalies for the same article. Anomaly color bars are also included in project-level graphs where we felt they accurately highlighted important points and are excluded where they hindered interpretation. These project-level graphs do not contain numbers on their vertical axis because the data backing these graphs is only publicly available at a less granular level. Numbers are also omitted on the vertical axis of graphs that depict multiple articles at once, but for a different reason. To account for the varying levels of traffic between articles, the vertical axis depicts the percent change in traffic since the start of the graph period. This effectively normalizes the number of requests across articles, and the axis is indicated as such. Anomaly color bars are omitted on multi-article graphs as they tended to hinder interpretability.

Findings By Country
Country boundaries are not mirrored in the network topology of the Internet with much fidelity, but the censorship decisions with the broadest impact are often made at the national level, so we believe the state is a useful level of assessment. Below, we have highlighted a number of countries. These countries were chosen because they have either reportedly blocked Wikipedia content at some point in the past or because we have evidence of past or present broader Internet censorship within the country. For each country, we provide a short summary of the history and current state of local Internet filtering. Following that, we include the country-specific results of our data analysis that were the most noteworthy. These results may include tests from client locations, analysis of projectlevel data, or analysis of article-level data.

China
China's Internet filtering apparatus is one of the most pervasive and complex in the world. Freedom House has expressed strong concerns about China's Internet freedoms noting that the country uses a wide variety of techniques-IP blocking, throttling, man-in-the-middle attacks, deep packet inspection, DNS poisoning, keyword filtering, content removal, SMS and instant message filtering, the blocking of VPNs, and full Internet shutdowns in some areas-to block political and sexually explicit content, globally popular social media and publishing platforms, and Google and many of its More recent research has suggested that criticism of the ruling party is largely tolerated while content that has the potential to spur real-world collective action is of primary concern to censors. 33 Chinese censors have a long and contentious history with Wikipedia. The first Chinese-language Wikipedia project, chinese.wikipedia.org, was launched in May 2001; the first Chinese-language article was published in October 2002, the same month the project moved to zh.wikipedia.org. 34 35 The project faced its first challenge from censors in June 2004 when it was temporarily blocked during the anniversary of the Tiananmen Square protests. 36 The entire project has been blocked on and off since; article-level filtering of sensitive content was reportedly instituted around 2006. 37 The introduction of an HTTPS version in 2011 temporarily gave users in China full access to the project, including articles blocked on the HTTP site. 38 Sophisticated data analysis techniques were not required to identify if and when China has blocked access to Wikipedia News reports around the time of this event corroborate that it was indeed caused by intentional government censorship. 39 We analyzed similar graphs for Wikipedia's 291 other language projects and saw no indications of similar anomalies.
While data analysis is not necessary to detect obvious and documented censorship events, our analysis of article-level censorship also picked up this anomaly. As would be expected from this type of censorship, thousands of articles hosted on zh.wikipedia.org saw strong downward anomalies at the same time: It is important to reiterate that this graph depicts the number of requests to these articles from all geographic locations, not just those requests originating in China. These anomalies are detectable only because a large portion of the worldwide traffic to these Chinese language articles originated in China.
Wikipedia's transition to HTTPS-only delivery occurred in June 2015-almost four weeks after China blocked access to all of zh.wikipedia.org. For that reason, we were unable to analyze the results of the transition to HTTPS-only on the number of requests for Chinese articles.
Using our client network, we were able to confirm that this censorship was ongoing as of late June 2016. We were unable to access zh.wikipedia.org from either of two testing locations in mainland China. While technical limitations in the current deployment of our client network prevent us from We were also able to confirm the result that the zh.wikipedia.org domain was the only Wikipedia project affected by this censorship. Our client machines in both locations were able to successfully and reliably access the other 291 Wikipedia subdomains. 41 In order to check for throughput limitations that may or may not have been intentional ("throttling"), we timed how long it took for a complete response to reach our test clients after sending each request. We refer to this time period as the "round-trip time" ("RTT") throughout. We calculated the mean, median, and max round-trip times to each of the projects from both of our test locations. The results for our tests from China are summarized below: Because these articles contain little content, the number of requests recover overnight, and the article histories show nothing that might explain these changes (such as deleting or renaming the articles), we suspect this behavior might be indicative of either external links to the pages changing or a bot or some other form of programmatic request temporarily suspending activity.

Mean RTT
Additionally, our analysis highlighted many anomalous events beginning around August 14, 2013 as well as around August 7, 2015. Articles that were part of these events did not appear thematically related, but traffic drops were significant, and the events were limited to articles in the zh.wikipedia.org domain. While our research did not turn up anything for these dates, we document them here with the hope that they might hold some significance for those more familiar with either Wikipedia's infrastructure or Chinese manipulation of Internet traffic.
While article-level analysis contributed little to the historical context surrounding Chinese Internet censorship, as outlined above, this type of analysis is widely available, and it did serve to bolster our findings from our other methods. Our client tests showed that one Wikipedia domain, zh.wikipedia.org, was completely inaccessible in China, while all other projects were available. Wikimedia's own data on traffic to its projects showed obvious indications of the censorship events reported in the media. While Internet censorship in China is widespread, as of June 2016, Chinese censorship of Wikipedia appears limited to the zh.wikipedia.org domain.

Cuba
The past three years have seen considerable growth in Cuba's Internet infrastructure, but access is still limited and tightly controlled. The country has two ISPs, both of which are state-owned, and Cuba uses the Avila Link monitoring software to track Internet users and obtain usernames and passwords. 42 Most Cubans are only permitted access to the intranet, which includes a small selection of government-approved websites and services; access to the global public Internet is largely limited to a handful of public WiFi access points and expensive government-run Internet cafes. The Revolutionary Orientation Department (DOR) oversees filtering in the country. 43 Political content, Analyzing Accessibility of Wikipedia Projects Around the World including dissident blogs and news sites, is heavily filtered; common social media platforms such as Facebook and Twitter, VoIP services, and web services such as Yahoo and Hotmail are intermittently blocked. 44 We did not have a client testing node available in Cuba.
Almost 100% of the requests coming out of Cuba are for either Spanish or English Wikipedia. 45 Visible in the graphs above is a steep decrease in traffic around the June 12, 2015 HTTPS-only transition. Apart from that, traffic from May 2015 to July 2016 does not show signs that might indicate widespread censorship. Our anomaly detection algorithm did not detect any significant anomalies in the request histories of any other Wikipedia project. While access to the public Internet is restricted, for those with access, we were unable to find any firm evidence that Cuba was censoring any Wikipedia project.

Egypt
Despite offering comparatively free and open access to a wide spectrum of online content, Egypt's Internet environment is still tightly controlled. Political, social, and religious websites are broadly available, but arrests, attacks, self-censorship, and full Internet shutdowns contribute to an atmosphere of repression. Many activists are worried about the draft of a new cybercrime bill introduced in 2015 that would allow the government to heavily increase its censorship role in the name of national security. 46 While this has not yet been enacted, other laws require owners of

Analyzing Accessibility of Wikipedia Projects Around the World
Internet cafes to track the identities and activities of customers online. VoIP services and encryption tools are also restricted according to Egyptian Telecommunications Laws, but these laws are not widely enforced.
Though it rarely filters online content, the Egyptian government is known for arresting bloggers and journalists critical of the country's current leadership or of Islam. Access to the Internet was most limited during the early 2011 Egyptian revolution protests: for two days both Twitter and Facebook were blocked, and for four days after that, the Internet was down throughout the country. The state's control over the country's telecommunications infrastructure, which is primarily owned by While holidays are rarely relevant when discussing the the availability of websites, it is important to note their effect on web traffic, as they can often look similar both statistically and graphically to other types of outage events. A large part of our manual review process was dedicated to successfully ignoring holiday effects.
As of June 2016, we had no evidence that Egypt censored any part of Wikipedia.

Indonesia
Internet censorship in Indonesia is managed by the Ministry of Communication and Information (MCI), which has broad powers to block "negative" content, mostly granted through the Information and Electronic Transactions Law (ITE). 48 MCI maintains a system called Trust Positive 49 which acts as a database cataloguing content that should be censored, but the actual implementation of censorship is left up to the ISPs. As of June 2016, Trust Positive contained approximately 770,000 URLs, about 99.5% of which were categorized as pornographic. Due to the decentralized nature of the censorship infrastructure, some ISPs filter additional URLs while others do not enforce all of the government mandated blocks. 50 For this reason, it is hard to attribute each censored website to the government.
Though most of the content blocked by Indonesian law is pornographic, the relevant statutes are ambiguous, so content related to radicalism, violence, hate speech, fraud, gambling, child violence and pornography, internet security, and intellectual property rights also sees censorship. 51 The pornographic category itself is also very broadly defined. In 2010, the OpenNet Initiative documented evidence of substantial blocking of pornography across different ISPs, but this block also included sites related to women's rights and LGBT websites. 52 Occasionally, LGBT content is specifically targeted by censors, despite being legal in the country. 53 Censors in Indonesia have also appeared willing to censor entire platforms for relatively small amounts of content, at various times blocking all of Netflix, Tumblr, Reddit, and Vimeo, mostly for nudity or sexually explicit content. 54  We consider these articles particularly suspicious because most of them are sexual in nature, which, as noted above, is a sensitive topic in Indonesia. None appear to be related to changes to the articles themselves that might otherwise explain significant traffic decreases (such as article deletion or renaming). We do note though that none of the articles show substantial and sustained increases in traffic after the HTTPS-only transition of mid-June, 2015, which we might expect for articles that were censored.

INTERNET MONITOR
Client side tests from Indonesia returned nothing indicating domain or subdomain blocking. Roundtrip times were somewhat slow, but still within normal boundaries. One project, mus.wikipedia.org, took more than four seconds to return, but subsequent requests returned in regular time.

Mean RTT Max RTT
599 ms 565.4 ms 4588 ms to mus.wikipedia.org The server-side data on Indonesia did include one significant anomaly that did not appear related to a public holiday. Traffic to Indonesian and English Wikipedias was significantly lower than normal on July 16, 2015. Further research suggests this might have been related to the eruption of two volcanoes, which caused other disturbances throughout the country. 57 57 "Indonesia closes three airports as two volcanoes erupt," Deutsche Welle, Jul 16, 2015, http://www.dw.com/en/indonesia-closes-three-airports-as-two-volcanoes-erupt/a-18589931.

INTERNET MONITOR
While it does appear possible that network operators in Indonesia instituted some level of article censorship in the past, our server-side and client-side data analysis did not locate any evidence that censorship of any of Wikipedia's projects was taking place as of June 2016.

Iran
Internet filtering in Iran is implemented by the Commission to Determine the Instances of Criminal Content (CDICC) and broadly overseen by the Supreme Council of Cyberspace; both groups are primarily composed of members appointed by Supreme Leader Ayatollah Ali Khamenei. 58 Content related to the political opposition, human rights (particularly women's rights), minorities, religion, and sex is heavily filtered, as are independent and international media, many major social media platforms, and circumvention tools. 59 60 President Hassan Rouhani, elected in 2013, promised during his campaign to "ensure that the people of Iran will comfortably be able to access all information globally" and stated that "all human beings have a right" to use social networks. 61 Despite those statements, Facebook, Twitter, and a number of other platforms remain blocked, though Rouhani's administration did resist a CDICC order to block WhatsApp in 2014. 62 In 2006, then-president Mahmoud Ahmadinejad announced plans to build a national Internet system, in part to improve the country's digital infrastructure and increase speeds, which are currently among the lowest in the world. 63 The project is considerably behind schedule, but is moving forward. One of the project's stated goals is to move the entire country onto a national network, largely disconnected from the greater World Wide Web, to help ensure that Iranian Internet users are accessing "clean" content on domestic Internet hosts. 64 Iran's current filtering technology is already quite centralized: traffic in and out of the country is routed through the

Analyzing Accessibility of Wikipedia
Projects Around the World

INTERNET MONITOR
previously state-owned Telecommunications Infrastructure Company, providing the government with the means to monitor online activities, limit access, throttle speeds, and redirect users attempting to access blocked sites. Authorities also employ keyword filtering, SSL man-in-themiddle attacks, and potentially deep packet inspection to manipulate traffic. 65 Iran has intermittently blocked access to the HTTPS version of Wikipedia since it was introduced in 2011; the English and Kurdish versions of the site have also seen temporary blocks. 66 In 2013, researchers used proxy servers in Iran to scan every Persian-language Wikipedia URLapproximately 1.7 million in total-and identified nearly 1,000 blocked articles. Just over 400 of these contained political content; the others involved sex, religion, human rights, arts and culture, media and journalists, academia, profanity, drugs, and alcohol. Over half of the blocked articles were biographies of individuals; approximately half of those were biographies of people the government had arrested, detained, or killed. The study concludes that Wikipedia filtering in Iran is in part keyword-based, triggered when users request URLs that match a blacklist of terms; approximately 200 of the articles were filtered on this basis, while the rest were individually blocked. 67 Given this, the transition to HTTPS-only delivery of content in 2015 should have substantially affected the Iranian government's ability to censor Wikipedia articles.
Our article-level analysis indicates that this was indeed the case. We note again that article request histories are not broken out by country; however, Wikimedia's data shows that a large share of the traffic to Persian Wikipedia (fa.wikipedia.org) originates in Iran. 68 Borrowing methodology from another Wikimedia research project, 69 we searched our database of anomalies for articles that saw significantly higher levels of traffic starting around June 12, 2015 (the HTTPS-only transition). We then manually reviewed the resulting articles. This step revealed that many of the articles our algorithm detected saw increased traffic because they were moved or renamed at around the same time as the transition. After removing those articles from our results, we were left with 22 articles that saw increased traffic after the transition that could not be explained by other means.
The set of articles Iran was censoring at the time of the transition was certainly larger than this (as evidenced by the study referenced above), but we do not claim comprehensiveness. We did find that many of the articles identified by our process belonged to the same categories that were most likely to see censorship in the previous research. The set of articles we identified consisted mostly of

Analyzing Accessibility of Wikipedia
Projects Around the World articles related to sex (fifteen articles, e.g., the Persian equivalents of "Sex" and "Cunnilingus"), but also contained political reformers (e.g., the Persian translation of "Mohammad Khatami") and governmental institutions (e.g., the Persian translation of "Army of the Guardians of the Islamic Revolution"). The full list of articles and their English equivalents is included in Appendix C.
Below is a graph of daily traffic to all 22 articles from December 2011 onward: The uptick in June 2015 is visible, as are two events beginning the end of December 2011 and the end of March 2012 that affect most articles in the set. It is possible Iranian network operators were testing or otherwise adjusting their censorship capabilities around this time, but we were not able to find documented evidence of this.
Much of our methodology was designed around locating the beginning of censorship events rather than the end. While this did not produce many positive results, we do believe this method identified the start of a censorship event on Persian Wikipedia. The top four anomalies starting on February 26, 2015 for articles in the fa.wikipedia.org domain are:

Article Translation
Three are clearly identifiable as alcohols, the consumption of which has been illegal in Iran since 1979. 70 The article that translates to "Sweat" is a disambiguation article whose first link is to the article " ‫عرق‬ ‫سگی‬ ," which translates to "Aragh Sagi." "Aragh Sagi" is a type of alcohol, and ‫"عرق"‬ is a translation of both "sweat" and "distillate." We believe specific censorship of a disambiguation page to be unlikely, and instead suggest that this fact supports the assertion that filtering in Iran is at least

Analyzing Accessibility of Wikipedia
Projects Around the World

INTERNET MONITOR
partially keyword based. The graph of daily requests for these four articles around this time period is below: The drop in requests is clearly visible, and while we have not calculated the statistical significance, it appears as if each traffic to each article increases slightly beginning right after the HTTPS-only transition. It is also interesting to note that if this is a censorship event, Iranian censorship officials were actively adding to their lists of blocked content as recently as February 2015, which means these articles were likely censored for only a matter of months.
We also located an event during the spring of 2013 during which more than 20 seemingly unrelated articles saw large falls in traffic (e.g., the following graph depicts traffic to the Persian equivalents of "Psychology," "Immanuel Kant," "Don Quixote," and "Cosmetics"): We consider this event unlikely to be censorship because although it happens slightly later, it is similar to many other events across many languages during the spring of 2013 in which numerous unrelated articles saw dramatically decreased traffic before returning to normal levels weeks later. This widespread event is documented in Appendix D. This could have a number of causes, though we believe this decrease is less likely to be related to censorship, as similar decreases in traffic around the time of the transition can be seen in traffic from countries not known to have blocked any part of Wikipedia in the past (e.g., Fiji, outlined in Additional Findings below).
We did not have client testing infrastructure in place in Iran.
Our research on Iran uncovered evidence backing the claims of previous researchers that Iran has blocked Wikipedia articles in the past and that many of those were related to sex or Iranian politics. We further suggest that Wikipedia's transition to HTTPS disabled at least some part of this censorship. While server-side and article analysis indicated that portions of Wikipedia had been censored by Iran in the past, as of late June 2016, evidence of this censorship no longer existed, and at least some articles that had likely seen censorship were receiving increased levels of traffic since Wikipedia's transition to HTTPS.

Kazakhstan
The most heavily censored content in Kazakhstan is that related to religious extremism. Most blocking happens by court order, and throughout all of 2014, the Prosecutor General's Office asked courts to block 703 websites and 198 specific URLs related to the topic. The most significant recent cases of such censorship were related to domestic and international coverage of Kazakhstan's association with ISIS. For example, in the fall of 2014, any web pages containing a series of ISIS videos portraying alleged Kazakh nationals as ISIS soldiers were blocked. 71 Though the bulk of censorship is dedicated to extremism, popular social media sites have also been targets in the past, though official reasons are rarely given, and ISPs often deny blocking the sites. and VKontakte were blocked intermittently for short periods of time in 2014. 73 There have also been several cases of content removal from YouTube, such as a video of ethnic related struggle in South Kazakhstan. Some websites are blocked without any evident court decision, including two major Central Asian news sites, Ca-news (based in Kyrgyzstan) and Fergananews (based in Russia), which are inaccessible for unknown reasons. 74 Article-level analysis of Kazakh Wikipedia discovered a significant number of anomalies, though further investigation suggested all were associated with the public holidays of either Gregorian New Year or Nowruz (beginning around March 20). Analysis of server-side data revealed much the same thing: Client-side tests from Kazakhstan were highly inconsistent, with all projects seeing a large number of intermittent errors. These intermittent errors occurred on all tested domains, pointing to an error in the testing node rather than any external issues. Despite this fact, after repeated requests, we were able to successfully receive responses from all Wikipedia projects. The timing of the network requests did not indicate anything out of the ordinary: intermittent" 76 but generally targets topic areas that threaten national security or are religiously blasphemous. Access to international news organizations and independent media is generally open, as is access to the websites of human rights organizations, local civil society groups, and Pakistani political parties. Since 2011, all online pornography has been banned, a block that has also affected some sex education and health websites. 77 YouTube has been largely blocked since 2012, when an anti-Islamic video garnered attention throughout the Muslim world. 78 In January 2016, a localized version of YouTube was created that allows the Pakistani government to monitor and take down content deemed inappropriate. 79 In 2013, Citizen Lab researchers documented the use of Netsweeper filters to block political, social, and religious on the network of Pakistan Telecommunication Company Limited, the largest telecommunications company in the country. 80 Facebook and Twitter have received public criticism in the West for limiting access to content at the request of the Pakistani government; 81 in 2014, both platforms republished previously blocked content. Wikipedia is generally accessible, but was blocked for a few hours in 2006 and for several days in 2010. 82 83 Attributing historical article-level censorship to Pakistan is difficult. As 98% of Wikipedia requests are directed at English Wikipedia,84 and our current data does not allow us to separate Pakistani requests to English Wikipedia from requests from other countries, we have little-to-no ability to detect Pakistani article censorship.

Median
Our client test node in Pakistan was able to access all Wikipedia projects in a timely fashion:

Russia
Over the past few years, the Russian government has systematically moved to increase its control over the online information environment, passing new legislation that expands authorities' power to access user data, monitor online activity, and block and take down websites. 85 OpenNet Initiative testing in 2010 found evidence of filtering only of sexually explicit content, but no evidence of political filtering. 86 In the past six years, filtering has grown dramatically and now includes opposition websites, content related to the 2014 conflict in Ukraine and other political protests and events, "extremist" content, and information about drugs and suicide. 87 The federal agency Roskomnadzor, tasked with supervising electronic media in the country, maintains a blacklist of blocked sites; several Wikipedia articles in both Russian and English, most related to drugs or suicide, have reportedly appeared on the list since 2012. 88 In July 2012, editors of Russian-language Wikipedia shut down the site for 24 hours to protest pending legislation that would increase the government's powers to block online content. 89 This event was represented in our article-level analysis as the most anomalous event we saw for Russian Wikipedia. On July 10, 2012, there were significant decreases in traffic across more than 1,000 articles that quickly disappeared the next day: 85

Analyzing Accessibility of Wikipedia
Projects Around the World

INTERNET MONITOR
The fact that about two-thirds of all traffic to Russian Wikipedia originates in Russia 90 supports the conclusion that this event was indeed related to the protest.
In August 2015, access to ru.wikipedia.org was temporarily blocked after Russian Wikipedia did not meet Roskomnadzor's demands to remove an article about a type of cannabis. The site's use of HTTPS meant the internet service providers were unable to block the individual offending page and therefore would have to block all of Russian Wikipedia. 91 The block lasted for several hours before Roskomnadzor announced that the article had been sufficiently edited to meet its guidelines, though Wikipedia editors said the page remained the same. 92 The decrease in traffic that this ban likely caused was not detected by our algorithm on either the article level or the level of Russian Wikipedia as a whole.   93 We were unable to find any other evidence to support this hypothesis.
Russian Wikipedia also contained a relatively large number of anomalies that were limited in scope to single articles. Investigation of many of these cases revealed that in most circumstances, the articles in question were deleted or moved (e.g. "Наг а" ["Nudity"] on July 9, 2014, "К в " ["Kosovo"] on January 8, 2013, and "Ма в е_ б в " ["Mass Murder"] on February 20, 2014).
These were picked up by the anomaly detection algorithm, as they often had a significant amount of traffic prior to deletion. After manually removing from analysis those articles that had plausible explanations for traffic drops, we were still left with a number of articles with unexplained significant traffic drops:

Saudi Arabia
In 2014, Reporters without Borders ranked the Kingdom of Saudi Arabia 164th out of 180 countries in terms of press freedom, emphasizing that the Kingdom is "relentless in its censorship of the Saudi media and the Internet." 94 All international Internet traffic is routed through two national providers, Integrated Telecom Company and Bayanat al-Oula for Network Services, giving the government the ability to review and filter requests. 95 The Communications and Information Technology Commission oversees Internet filtering in the country, and the list of content blocked in the country is long. First, Saudi Arabia uses commercially available software (SmartFilter 96 ) to locate URLs related to pornography, gambling and drugs, which it then blocks. They also maintain a local list of

Analyzing Accessibility of Wikipedia
Projects Around the World

INTERNET MONITOR
URLs separate from this categorization mechanism. 97 This list reportedly contains a broader set of content, including content related to violent extremism, criticism of Gulf royal families, political opposition, censorship circumvention tools, P2P file sharing tools, LGBT issues, human rights organizations, religious scholars (especially those related to the minority Shi'a faith), mirror sites, and unlicensed online publications. 98 99 It is unclear how willing Saudi authorities are to block entire sites over single pieces of content. In 2012, the government threatened to block YouTube if a controversial video was not taken down, but the blocking did not occur because YouTube removed the video in question. 100 Internet restrictions in Saudi Arabia are not limited to content filtering; a 2009 law led to the installation of hidden cameras in all web cafes to track users, and self-censorship among online writers is widespread. 101 The government regularly arrests those who use social media to document human rights abuses, express political opinions critical of the ruling family, or criticize the official religion; those who are convicted are sentenced to jail time and, in at least one case, corporal punishment. 102 In 2006, Saudi Internet users started reporting the censorship of a number of Wikipedia pages in both English and Arabic, mostly related to sexual content. 103 When the blocking occurred, some Saudi citizens felt that some of the pages were unfairly blocked and contained "beneficial" content. 104 Arabic and English Wikipedia together account for more than 95% of the requests from Saudi Arabia, 105 and our analysis did not show the type of traffic anomaly that would be indicative of domain blocking over the period from May 2015 to July 2016.

Analyzing Accessibility of Wikipedia
Projects Around the World

INTERNET MONITOR
We were able to access all Wikipedia subdomains from our client test point in Saudi Arabia, and all round-trip times were within normal ranges:

Mean RTT Max RTT
283.5 ms 308.6 ms 1142 ms to am.wikipedia.org Our article-level analysis is not segmented by country, and while the largest share of requests to Arabic Wikipedia come from Saudi Arabia, that share is only approximately one-fifth. 106 If we were to locate likely censorship events in Arabic Wikipedia, it would be impossible without additional data to definitively attribute that censorship to Saudi Arabia. Given the results of our client and server data analysis, as of June 2016 we had no firm evidence that Saudi Arabia was censoring any Wikipedia domain or subdomain.

South Korea
South Korea's Internet filtering regime is largely focused on its relations with North Korea and on sexually explicit content. The majority of banned websites are North Korean news organizations or sites run by North Korean "sympathizers," but pornography and LGBT websites are also widely banned. 107 The National Security Act in Cyberspace prohibits, among other things, "sympathizing" with North Korea online; more than 100 people were convicted of this crime between 2012 and 2014. 108 South Korea's constitution states that "neither speech nor the press may violate the honor or rights of other persons nor undermine public morale or social ethics." 109 These restrictions have been used to justify the censoring of attacks against politicians, sites connected to North Korea, and pornography sites. The government's decision to ban online gaming for six hours each day for citizens younger than sixteen also loosely falls under this guideline. 110

Analyzing Accessibility of Wikipedia
Projects Around the World

INTERNET MONITOR
The Korean Communications Standards Commission (KCSC) is in charge of regulating the Internet, but in 2014 the Public Prosecutor's office set up an investigative unit charged with monitoring online slander and rumors. 111 South Korea has a history of defamation cases involving the Internet; in 2012 a National Intelligence Service (NIS) agent removed Twitter accounts that were critical of President Park Geun-hye, who was running for reelection at the time. 112 Just two years later, Han Sun-Kyo, a conservative, attempted to pass a law that would prevent "rumor mongering" in the wake of the capsizing of the Sewol ferry, which left over 300 people dead. 113 Harsh punishments for defamation exist in South Korea; online defamation is penalized severely, with fines reaching $45,000 USD at times. 114 The article-level analysis we conducted revealed some anomalies that could not be attributed to changes to the articles themselves. There were only two anomalies that occurred at approximately the same time: " " ("Perineum") and " " ("Agnosticism"): While it is interesting that traffic to both articles dropped significantly on the same day, the fact that this anomaly was limited to these two articles and that they are not closely related thematically makes us doubt that this was a censorship event. There were other anomalous events for single articles throughout our analysis, the most significant of which were " " ("Engine Oil") starting on February 19, 2014 and " " ("Song of songs") starting on May 17, 2014.
We were able to access all Wikipedia project subdomains from our test location in South Korea with no problems. Response times for each of the domains were within typical ranges: 111 "Freedom on the Net 2015: South Korea," Freedom House, Oct 2015, https://freedomhouse.org/report/freedomnet/2015/south-korea. 112 Ibid. 113 Ibid. 114 Ibid. The history of requests from South Korea to both Korean and English Wikipedias over the period of analysis appear regular with no signs of outages: As of June 2016, we were unable to find any strong evidence that South Korea has censored or was censoring any of Wikipedia's articles or projects.

Syria
Syrian netizens experience extensive censorship online around politics, minorities, human rights, and foreign affairs. Examples of censored content include the London-based news outlets Al-Quds al-Arabi and Asharq al-Awsat, many Lebanese online newspapers, websites campaigning to end Syrian influence in Lebanon, WhatsApp, the Muslim Brotherhood, websites that advocate for the Kurdish minority, and the entire Israeli top-level domain ".il." Websites related to human rights awareness such as the Violations Documentation Center are also blocked. 115 According to the Wall Street Journal in 2012, out of 2,500 attempts to visit Facebook, two-fifths were permitted and three-fifths were blocked. 116 Censorship also extends to mobile communication: Bloomberg reported in 2012 that a special government unit known as Branch 225 had ordered Syrian mobile providers to block text messages containing words like "revolution" or "demonstration." 117 The fact that both YouTube and some pages on Facebook remain accessible make activists suspect that the current regime is trying to track citizens' online activities. Other social media applications like the VoIP service Skype suffer from disruptions either due to low speeds or intermittent blocking by the authorities. Over the past decade authorities have detained hundreds of Internet users, including several well-known bloggers and citizen journalists. 118 Wikipedia in Arabic was reportedly blocked from April 2008 until

Analyzing Accessibility of Wikipedia
Projects Around the World

INTERNET MONITOR
Using our methodology and the data available, article-level censorship would be difficult to attribute to Syria, as Arabic Wikipedia is accessed heavily from many countries. We did not have a client test node in Syria.
Our analysis of server-side data detected no significant anomalies in traffic from Syria to any Wikipedia project. Nevertheless, we conducted a manual review of Arabic and English Wikipedia because they are the most popular Wikipedia projects in Syria, together accounting for approximately 98% of traffic. 120 The number of requests to these Wikipedia projects show no significant anomalies between May 2015 and July 2016: While censorship of the Internet is known to be widespread in Syria, and censorship of Wikipedia specifically has occurred in the past, the lack of both data and access made establishing the June 2016 state of Wikipedia in Syria particularly difficult.

Thailand
Censored content in Thailand is similar to that of other countries: pornography, gambling, and censorship circumvention tools are all extensively blocked, 121 but the censorship extends to content that is specifically sensitive in the context of Thailand. As there have been a number of coups in Thailand in recent years, political opposition and activism content is strongly suppressed, as are some foreign news outlets, some domestic news outlets, human rights content, select academic websites, and Facebook and YouTube pages that relate to coups. 122 Lèse-majesté, the insult or defamation of royalty, is a serious crime in Thailand, and has lead to a number of censorship incidents. The law against lèse-majesté has been used to prosecute those who have posted social media updates, news articles, audio and video content, and poetry deemed offensive, as well as at least one Internet user who sent an email containing links to lèse-majesté content. In 2015, prison sentences for violating the prohibition on lèse-majesté reached a record high of 60 years. 123 In 2008, the Wikipedia article for Bhumibol Adulyadej, the King of Thailand, was Analyzing Accessibility of Wikipedia Projects Around the World reportedly blocked. 124 No official reason was given, but it was possibly related to the lèse-majesté law.
The military junta that took power during the 2014 coup intensified controls over the Internet, instituting new filtering and surveillance and arresting activists and others. 125 During the coup, the government ordered ISPs to block Facebook in an effort to prevent activists from protesting. 126 Those who criticized the coup online were detained and forced to promise silence and turn over their social media passwords in exchange for their release; the government collected 400 passwords this way in 2014. 127 In August 2015, the government announced plans to implement a "Great Firewall" that would have directed all Internet traffic through a single point, but it abandoned the plan in October 2015. 128 Article-level analysis of Thai Wikipedia did not reveal anything we would consider likely censorship events, though there was an anomalous event that we were not able to explain. Starting around August 17, 2015, a number of thematically unrelated articles saw significant decreases in traffic that each lasted for about a month before returning to previous levels. This anomaly took place during the period of time for which we had request data broken out by both article and geography, and we were able to confirm that this anomaly was present in requests originating in Thailand. A graph depicting the event is below: Round-trip times to the various Wikipedias were the slowest of all our testing locations, but they were still within acceptable ranges. While it is unlikely we identified article-level censorship in Thailand, we did confirm that as of June 2016, Thailand was at least intermittently interfering with the regular functioning of Wikipedia.

Turkey
Internet penetration and usage in Turkey has been rapidly increasing over the last decade with 2014 marking the first year more than half the Turkish population could be considered Internet users. 130 The dramatic increase in Internet usage has seen a concomitant increase in the Turkish government's efforts at controlling access to information on the Internet. Before 2007, Internet censorship in Turkey was sporadic and limited, 131 but the passage of Internet Law No. 5651 in May 2007 was the first big step toward systematizing Turkey's blocking regime and grounding it in a legal framework. 132 Among other things, Law 5651 outlined eight categories of content that were to be subject to blocking, required all Internet hosting and access providers in Turkey to obtain a license from the government, and granted an organization called the Presidency of Telecommunication and Communication (TIB) the authority to block any website it deemed in violation of the law. 133 Since the large, anti-government protests in Gezi Park in June of 2013 in which social media played a large role, Law 5651 has been amended numerous times to relax requirements for judicial review, broaden While there are now nine legal categories of criminal content (e.g. content relating to child pornography, obscenity, or gambling), censorship is not limited to these categories. Sites relating to pornography, intellectual property infringement, ethnic minorities, LGBT issues, political movements and news outlets have all been censored. 136 Content unrelated to any of these categories often sees censorship due to the common practice of blocking entire sites for single pieces of infringing content. YouTube, Twitter, Blogger and Wordpress have all been the subject of such blocks, 137 and according to a Turkish watchdog organization, as of June 2016, more than 110,000 unique domains are entirely blocked in Turkey. 138 While full domain blocking is widespread, not all censorship occurs at the domain level. Turkey also has the capability to filter individual pages, as has been the case with Turkish Wikipedia. Though it is unclear exactly who ordered the censorship and its full extent, 139 there have been media reports of at least five censored articles: "İnsan penisi" ("Human penis"), "Kadın üreme organları" ("Vulva"), "Testis torbası" ("Scrotum"), "Vajina" ("Vagina"), and "Haziran 2015 Türkiye genel seçimleri için yapılan anketler" ("Opinion polling for the Turkish general election, June 2015"). 140 For a number of weeks in the summer of 2015, Turkish Wikipedia included a banner on the main page warning users of this censorship. 141 Our article-level analysis included data for three of these censored articles:

Analyzing Accessibility of Wikipedia Projects Around the World
At the far right of the above graph, a sharp uptick in the number of requests for each of these three articles is plainly visible. This uptick occurs on June 12, 2015the same day as the transition to HTTPS-only delivery. And though we did not limit this analysis to only those requests originating in Turkey, approximately nine out of every ten requests for Turkish Wikipedia come from the country. 142 For these reasons, we believe this is likely an instance of the HTTPS transition enabling more access to these articles. If we look past June 12 though, the picture becomes more complex: The request spike of June 12 is but a small blip before a much larger increase in traffic on June 19. As can be seen above, this traffic volume slowly decreases over a number of weeks before falling back down around August 14, 2015. It is difficult to attribute this much larger increase and subsequent decrease to any single cause, though we believe automated activity is the most likely culprit. The fact that during this period of increased activity, each article followed a very similar pattern is one piece of evidence pointing to this conclusion. It is interesting to note that the average number of daily requests after this event is higher than the pre-HTTPS averageconsistent with the hypothesis that HTTPS enabled more access, regardless of the cause of the intervening anomaly.
Client-side analysis of all Wikipedia projects from a network location in Turkey revealed nothing indicative of domain-level censorship, and round-trip times were within normal ranges: Neither of these projects showed significant decreases in traffic apart from a short period of time around Ramadan (mid July, 2015):

Median
While all Wikipedia languages projects appeared available in our June 2016 client-side tests from Turkey, our analysis supports the media reports of a number of articles having been blocked in the past for at least a portion of Turkish citizens.

Uzbekistan
Though it may receive less media attention than other countries with high levels of censorship, Uzbekistan has one of the most intensely controlled online and media environments in the world. Internet censorship has been present in Uzbekistan since about 2002, and has been steadily increasing. Uzbek law prohibits Internet operators from disseminating information that calls for violent overthrow of the government, instigates other forms of violence, is pornographic, relates to religious extremism, or "degrades and defames human dignity." 144 The newly formed government organization that oversees this censorship, the Ministry for the Development of Information Technologies and Communications, is also charged with preventing the "negative influence on the public consciousness of citizens, in particular of young people." 145 The actual implementation of these laws has created a censorship regime as broad as the legal language suggests. The fairly well-defined categories of pornography and terrorism are blocked, but a host of other topics are censored as well, including: reports of government corruption, human rights organizations (including Amnesty International, Freedom House, and Human Rights Watch), Typically, governments target specific URLs for censorship, but occasionally entire domains are blocked. 148 One such occasion was the blocking of all of Uzbek Wikipedia in early 2012. 149 This was an interesting case because Uzbek Wikipedia is less popular in Uzbekistan than either Russian or English Wikipedia, in terms of both articles and the number of requests, and yet neither of those Wikipedias were blocked. 150 An official reason was never given for the ban, but some have speculated that it was due to the addition of a number of articles related to sex that took place shortly before the block. 151 It is possible that only Uzbek Wikipedia was blocked because Uzbek is the only official state language of Uzbekistan. 152 Surprisingly, this block is not visible in the number of requests to Uzbek Wikipedia's main page and was therefore not picked up by anomaly detection: A suspicious drop in requests occurs for articles that were hypothesized to be related to the block, but the date of the reported block and the dates of the anomalies do not agree: Starting on June 11, 2016, Russian, English, and Uzbek Wikipedias saw significant drops in the number of requests received. From June 11 through June 17, the average number of daily requests was 19% lower than the previous week for English Wikipedia, 15% lower for Russian Wikipedia, and 62% lower for Uzbek Wikipedia. Unlike previous analyses, this decrease cannot be accounted for by any public holiday that we could identify. These trends can be seen in the following graphs:

Analyzing Accessibility of Wikipedia Projects Around the World
Of particular note is the spike that occurred on the far right end of the graph for Uzbek Wikipedia.
On June 20, 2016, traffic to Uzbek Wikipedia appeared to return to its previous levels for one day before falling back down to the depressed levels. This spike is not present in requests to Russian or English Wikipedia. While daily spikes of this kind are common in web request data, they are typically spiking above base levels before returning to normal rather than rising quickly to normal levels from depressed levels. This leads us to believe this is unlikely to be a natural traffic event, but rather some external process interfering with requests to Uzbek Wikipedia. As noted above, this would not be unprecedented, though potential motivations for this recent censorship event are unknown. These long round trip times were not limited to Wikipedia projects and occur across all tests to all URLs, so we believe they were due to our deployment rather than anything related to throttling. The very long wait for nrm.wikipedia.org returned to near average on subsequent tests.
Given the unusual pattern of server-side data and our repeated inability to access Uzbek Wikipedia, we believe it is likely there was some kind of blocking of Uzbek Wikipedia occurring in Uzbekistan as of June 2016.

Vietnam
Online activity in Vietnam is tightly restricted through content filtering, fines, website licensing, targeted cyber attacks, and arrests and detentions. The vast majority of content censored in Vietnam is content that could conceivably challenge the power of the ruling political class. In September 2010, OpenNet Initiative researchers found that both of the government-owned ISPs, Viettel and FPT Telecom, were blocking opposition and political reform websites, Vietnamese-language news sites, sites related to the Degar ethnic minority, Facebook, and sites related to circumvention tools. 155 The Decree on Management, Provision, and Use of Internet Services and Information Content Online, adopted in 2013, prohibits the use of the Internet to "oppose the Socialist Republic of Vietnam; threaten the national security, social order, and safety; sabotage the 'national fraternity'; arouse animosity among races and religions; or contradict national traditions." 156 Circular 9, issued in 2014, requires a government license for companies founding new social media sites. The government also employs surveillance, requiring owners of cybercafes to track users' Internet activity. 157 In 2014 and 2015, the government imprisoned 29 bloggers, writers, and activists. Over the past few years, the government has instituted new legislation that strengthens its controls over online content and activity. In 2013, the government issued Decree 72 which forced social media companies to censor content and followed that with Decree 174 in 2014 which authorized punishments for online speech. The Vietnamese government also restricts freedom on the Internet with Article 258 which is a law that bans the "abuse of democratic rights to infringe upon the interests of the State, the legitimate rights and interests of organizations and citizens." 158 In 2014, the government used this law to prosecute over a dozen rights advocates and bloggers and to block two prominent blogs critical of the government. During Obama's official visit in May 2016, the government blocked Facebook to prevent "political dissidents" from voicing their grievances on social media, following a pattern established after it blocked Facebook during environmental protests earlier that month. 159 We believe that anomalies detected by article-level analysis of Vietnamese Wikipedia are likely associated with requests from Vietnam because almost 90% of traffic to Vietnamese Wikipedia comes from Vietnam. 160 We detected a number of significant anomalies in articles in Vietnamese Wikipedia. For instance, about 130 articles saw significant but temporary drops in traffic around the end of July 2013. These articles do not appear to be thematically related. The following graph 155 "Vietnam," OpenNet Initiative, Aug 7, 2012, https://opennet.net/research/profiles/vietnam. 156 Sayuri Umedia, "Vietnam: Controversial Internet Decree in Effect," Global Legal Monitor, Sep 6, 2013, http://www.loc.gov/law/foreign-news/article/vietnam-controversial-internet-decree-in-effect/. 157 "Vietnam: Freedom on the Net 2015," Freedom House, Oct 2015, https://freedomhouse.org/report/freedomnet/2015/vietnam. 158 Ibid.

Analyzing Accessibility of Wikipedia
Projects Around the World

INTERNET MONITOR
depicts "Veneration of the dead," "Mongolia," "Donnie Yen," "Guanyin," "Czechoslovakia," and "Chắn" (a card game): We were unable to find a suitable explanation for this drop in traffic, either internal or external to Wikipedia.
We did detect an anomaly for a number of articles that appear thematically related beginning around June 13, 2014. All of the articles to which traffic decreased substantially around this time are related to sex with the exception of "Đá gà," which translates to "Cockfight." We were unable to associate this decrease with any change internal to Wikipedia (e.g., the move or deletion of the article).

Article Translation
Làm tình bằng miệng Đá gà Note that unlike in the Iranian case, traffic to these articles does not appear to increase substantially after June 12, 2015:

Analyzing Accessibility of Wikipedia Projects Around the World
It is unclear why we do not see a concomitant increase in traffic as in the Iranian case, but there are at least two plausible explanations that are consistent with the observed data. First, the decrease in traffic could be unrelated to censorship and instead be due to some other change (e.g. cessation of bot activities) that would not be affected by HTTPS. This would be consistent with the fact that while pornography is illegal in Vietnam, 161 there are no known instances of censorship of Internet pornography. Second, it could be the case that this was a censorship event but user behavior toward the censored pages changed in such a way that traffic did not increase once the pages were available again (e.g. linking patterns changed in the intervening period to route away from the censored articles).
Perhaps relatedly, this larger event was predated by a significant decrease in the amount of traffic to the article "Tình dục hậu môn" ("Anal sex") on September 18, 2013: Again, this change does not appear to be associated with any change internal to Wikipedia. Unlike the previous event, it is not correlated in time with any other significant anomalies of the same nature.
Other articles that do not appear explicitly sexual in nature also experienced significant decreases in traffic that cannot be accounted for by actions on Wikipedia alone. These include the articles for "Girls' Generation," a South Korean music group, and "Huyện (Việt Nam)," an administrative district in Vietnam:

Analyzing Accessibility of Wikipedia Projects Around the World
Project-level traffic from Vietnam to Vietnamese Wikipedia appeared normal from May 2015 to July 2016 with the exception of the beginning of February, which is the multi-day holiday of Vietnamese New Year (Tết).
Our client-side testing from Vietnam showed no problems accessing any Wikipedia project subdomains, and page load times were within normal ranges:

Mean RTT Max RTT
367 ms 382.2 ms 980 ms to bxr.wikipedia.org While not conclusive on its own, we believe we have surfaced tentative evidence that at least some portion of Vietnam's Internet users may have been blocked from accessing sexually explicit articles in the past. The HTTPS transition now makes this type of censorship unlikely, and our client and server data analysis of June 2016 showed no evidence that Vietnam was blocking the entirety of any Wikipedia project.

Additional Findings
Each of our analysis methods had additional results that did not pertain directly to the countries enumerated above. We have provided these results here, organized by our method of analysis.

Article-level Analysis
Overall, our article analysis pipeline detected 92.4 million anomalies across the 1.7 million articles. We started by looking at the anomalies that indicated significant drops in traffic, and the first things we noticed were nine distinct periods of time in which traffic dropped precipitously for a large number of articles across most, if not all, Wikipedia projects. These events were fairly short, most lasting only a day. We believe these were likely periods of faulty data collection. The spring of 2013 also saw many articles across many projects lose significant traffic; these events were harder to explain as they did not all begin at the same time and often lasted for weeks. Together, the nine data collection errors and the spring of 2013 accounted for a large number of the most significant downward anomalies we witnessed. Due to their widespread nature, we considered them unlikely to be related to censorship, and therefore excluded them from the rest of our analysis. These periods are outlined in more detail in Appendix D.
After removing these dates, we were left with 84.1 million anomalies. 71 million were anomalous increases in traffic (which our pipeline also detected but we only briefly reviewed) and the remaining 13.1 million anomalies were significant decreases in traffic.
One more date of note was the date Wikipedia changed over to HTTPS for all traffic. While this might not have taken place simultaneously across the globe, it looks as though June 12, 2015 was most common date. Many, though not all, articles saw significant drops in traffic around this date. Eighteen of the articles that saw the sharpest decreases were articles for various currencies on English Wikipedia: While we were unsure of the cause, a plausible explanation might be that the infrastructural change that the transition required affected the collection of the request metrics to some degree. The data that we used for our analysis did not have requests by bots or spiders filtered out, so it is also possible that these automated processes were using HTTP to request articles and could not handle the redirect to HTTPS. It could also be the case that some network operators were performing full HTTPS protocol blocking. Whatever the cause, per-project request metrics also saw a number of drops around the HTTPS transition. The graph of Fiji below in the Server-side Data section is a good example.
Because our article-level analysis was broken out by language rather than country, we have a number of results that are limited to a single language, but could potentially relate to one or more countries. For example, nine countries each contribute more than 1% of the requests to English Wikipedia, including countries known to have blocked articles in the past (e.g., Iran). 162 This makes it difficult to attribute anomalies to any individual country for languages like English that are spoken in many countries. Nevertheless, we felt it important to include these anomalies for both completeness and to aid any research that may follow this report. Though we located and include a number of these anomalies, we did not spend as much effort investigating anomalies for these languages.
One event that stood out on English Wikipedia was a cluster of anomalies beginning around March 2, 2015. Twelve articles, all related to clothing, saw the most significant drops in traffic around that date. Those twelve articles are: "Blanket sleeper," "Coin purse," "Débutante dress," "Denim skirt," "Goggle jacket," "Gymslip," "Jodhpurs," "Nightshirt," "Nightwear," "Opera coat," "Swim diaper," and "Undershirt." A graph of their overlapping time series shows that the patterns are remarkably similar: Given the fairly innocuous or archaic natures of many of these articles, we think this is likely a good example of some automated process ceasing operations that previously contributed a large share of requests.

Analyzing Accessibility of Wikipedia
Projects Around the World

INTERNET MONITOR
Less innocuous articles saw correlated decreases in traffic around the beginning of June 2013. This event covered at least the following articles: "Femme fatale," "Hermaphrodite," "Homosexuality," "Kanyakumari," "Labia majora," "Mario," "Pantyhose," "Phallus," "Slavoj Žižek," "Testicle," "Undergarment." The following graph depicts this trend: Nothing was discovered in the histories of these articles that could explain their sudden drop in traffic.
Arabic Wikipedia saw a number of significant anomalies, but after investigation, a large share of these anomalies were likely caused by Wikipedia users changing article titles or introducing redirects. We did identify one cluster of thematically related anomalies. Beginning near December 8, 2013, traffic to articles for four Middle Eastern cities saw a quick increase and then sharp and sustained drop off. The following graph illustrates the phenomenon: These cities translate to, in the order shown in the graph's legend, "Irbid," "Aqaba," "Ramallah," and "Nablus," Due to relatively innocuous nature of the content and the synchronized increase in traffic across the four articles that we could not attribute to outside factors, we suspect this might be related to bot activity rather than censorship.
We did not cover the country of Serbia, but Serbian Wikipedia saw a temporary drop in requests to a number of articles that are mostly related to human health: Projects Around the World

INTERNET MONITOR
The English equivalents of these articles, in the order they are given in the graph's legend, are "Clinical depression," "Breast cancer," "Ovulation," "Multiple sclerosis," "Human papillomavirus," "Ten Commandments," and "Sperm." Nothing in the histories of these articles suggests a cause for this anomaly.
Sex-related articles on Portuguese Wikipedia saw a number of significant downward anomalies, but three articles related to sex all saw anomalies at around the same time that cannot be explained by changes in Wikipedia alone: "Boneca inflável" ("Sex doll"), "Creampie" ("Creampie (sexual act)"), and "Sexo virtual" ("Virtual sex"). These anomalies took place around March 1, 2013, and are seen below: "Gíria sexual" ("Sexual slang") saw a similar anomalous event, though months later on October 12, 2013: Analyzing Accessibility of Wikipedia Projects Around the World

INTERNET MONITOR
We felt that though suspicious, the small number of anomalies meant there was not enough evidence to suggest censorship.

Project-level Analysis
Anomaly detection on the number of daily requests at the project level turned up a number of interesting decreases in traffic from a number of different countries. We did not analyze all of the detected anomalies, but we did investigate a number of the larger anomalies in an effort to find potential causes. While causal links are hard to establish, there is evidence to suggest that inaccessibility caused by war, governmental decree, and natural disaster are all detectable in Wikipedia's data.
One of the largest anomalies our analysis uncovered was a decrease in traffic from Yemen that lasted for more than two weeks. From March 11 until March 27, 2016, requests for Arabic Wikipedia were down approximately 50% from normal levels. The same holds true for English Wikipedia. The anomaly is easily visible in the traffic graph: We believe this event might have been related to fighting between the Yemeni government and rebel groups that could have caused infrastructure outages, though we did not find media reports of such outages. This anomaly occurs around the same time fighting intensified in the country as the government forces broke the rebels' siege of Taiz, Yemen's third largest city. 163 The Republic of the Congo saw two fairly large decreases in traffic to French Wikipedia, the largest recipient of the country's traffic. 164 On October 20, 2015, a large, multi-day anomaly began around the same time as protests against the country's president. 165 Months later, when a presidential election took place, the country ordered a total media blackout. 166 This outage is obvious when looking at a graph of the data:

Analyzing Accessibility of Wikipedia
Projects Around the World

INTERNET MONITOR
A similar tactic seems to have been used by the government of Chad immediately after their presidential elections on April 10, 2016: 167 Natural disasters likely accounted for a large number of traffic anomalies detected by our algorithms. Perhaps the largest natural disaster that was seen in the data was the earthquake centered near Katmandu, Nepal on April 24, 2015. The earthquake caused extensive damage in the capital, surely knocking out Internet access along with power to a significant percentage of the population. 168 The outage and the recovery can be seen toward the left side of the following graph:

Analyzing Accessibility of Wikipedia
Projects Around the World

Next Steps and Conclusions
This report is part of a larger project aimed at locating the global boundaries of access to Wikipedia.
One of the weaknesses of the format is that reports are inherently locked in time while the Internet censorship landscape continues to change. To complement this report in ways that do not have the same drawback, we are currently undertaking two efforts: continued client-side availability monitoring and continued server-side data monitoring. Client-side monitoring of Wikipedia's projects will continue as long as resources allow. In the short term, it is due to expand as new vantage points are scheduled to come online in the latter half of 2016 that were not available during the writing of this report. Server-side monitoring of the levels of traffic from various countries to Wikipedia's language projects will continue in the medium-term and significant anomalies will be brought to the attention of the Wikimedia Foundation. Manual analysis and investigation of the detected anomalies for the purpose of informing the Wikimedia Foundation of potential censorship will not continue as the process is currently intensive in terms of both time and resources.
We believe that with support and further development, the process of detecting censorship and other outage events from Wikipedia data could be further automated and significantly improved. We have a number of ideas in this vein, some of which could leverage existing Wikipedia research. 171 While our process does not yet scale to the full size of Wikipedia, we believe that our multimodal methodology-and anomaly detection in particular-has real, demonstrable value to a number of communities. First, we hope that the research in this report has some utility to the Wikimedia Foundation in their efforts to make all knowledge freely available to every person. Second, in the process of doing this research, we have created a dataset of anomalies in request traffic to a select number of individual articles. This dataset is likely useful in answering research questions around Wikipedia itself, but it and others like it could be used to answer questions around singular and significant events in the demand for specific pieces of knowledge. We will publish our generated dataset and open source our anomaly detection pipeline. 172 Third, with Wikipedia's vast size and millions of daily requests, the Wikimedia Foundation has an incredible vantage point to witness events around the Internet beyond even the scope of its own large projects. We have shown that Wikipedia's data can be used to discover and track Internet shutdowns and broader outages around the world. If developed into a publicly accessible resource, this could be a tremendous data source for those interested in Internet accessibility issues.
While some of the raw data might be difficult to publish in a way that still preserves privacy, publishing the anomalies detected in Wikipedia's data has far fewer privacy concerns. As outlined above, anomalies at both the article and project level could still have extraordinary value to researchers, advocates, political scientists, sociologists, media and communication scholars, developers of circumvention technologies, policymakers, and others. The generation and publication of this data would also be a fine addition to the Wikimedia Foundation's mission to accumulate and share knowledge.
This information could also help alert the world of those interfering with the Wikimedia Foundation's mission. As of June 2016, it appears China, Thailand, and Uzbekistan are all likely interfering with or completely censoring some part of Wikipedia. The evidence we collected suggests that in each case, the censorship is limited to a single project (Chinese, Yiddish, and Uzbek Wikipedias, respectively). While collectively these projects contain more than one million articles, 173 considering the widespread use of filtering technologies and the vast coverage of Wikipedia, there is currently relatively little censorship of Wikipedia globally. In fact, our research suggests that on balance, there is less censorship happening now than before the transition to HTTPS-only content delivery in June 2015. This initial data suggests the decision to shift to HTTPS has been a good one in terms of ensuring accessibility to knowledge.
And though the current level of censorship may be relatively low, in an ideal world, the Wikimedia Foundation would need not tolerate any censorship. When working toward that ideal world, there are many priorities and values to balance. We hope that our research has provided some useful context and a number of possible options to consider as the Wikimedia Foundation advances its mission in the future. A shorter but similar event appears to have taken place in mid-June 2013. This period was not omitted from analysis, but events from this period were given more scrutiny.
A number of dates saw widespread anomalies there were harder to classify, but did not appear related to censorship. Anomalies that occurred on or around these dates were considered less likely to be due to censorship, and more likely to be part of the underlying event causing widespread anomalies. Until the manual review of anomalies for many projects had taken place and the crossproject nature of these events were uncovered, many of these events were considered possible blocking events. For instance, starting around November 9, 2013, articles across a wide number of projects saw distinct drops in traffic. Interestingly, and so far inexplicably, most (though not all) of these articles contained apostrophes in their titles: Analyzing Accessibility of Wikipedia Projects Around the World

INTERNET MONITOR
English Wikipedia saw highly significant dropoffs like those illustrated above in at least 32 articles containing apostrophes in the title. One fact that often made this event appear related to censorship was that in many languages, the articles most likely to contain apostrophes in the title are often related to Islam: Beginning around December 9, 2013 and peaking around December 12, 2013, more than 175 articles from 27 different projects saw significant and sustained decreases in traffic. Many of the articles were for single letters ("A," "P," "Ă," "Г"), and at least six projects saw significant drops for the article "Go": While there were definitely patterns in the articles that saw significant drops, it proved difficult to attribute the pattern to any single cause.
October 16, 2014 was the peak of another set of anomalies. While this event took place for at least 14 Wikipedia projects, Japanese and Indonesian Wikipedias were the projects with the most affected articles. No clear relationship or patterns existed among the articles: Analyzing Accessibility of Wikipedia Projects Around the World

Appendix E: Article Analysis Methods In-Depth
As a method of inferring potential censorship motivation and providing context, we set out to identify articles that had been blocked in the past. In this idea's original iteration, we planned on detecting downward anomalies in the number of requests per day from each country to each Wikipedia article for as far back as the historical data would allow. The hypothesis was that fast, dramatic drops in the number of requests for an article do not occur as organic traffic patterns; rather, they must be the product of events that serve to move requests elsewhere or terminate the requests altogether. Our hope was that by locating these drops in traffic, we would have a heuristic for locating likely censorship events.
We started our analysis by creating a full list of all articles across all languages that could potentially be tested for anomalies. We first assembled a list of all Wikipedia projects, which resulted in 292 distinct projects. 179 For each of these projects, we downloaded the publicly available "Base per-page data" dumps from April 7, 2016. 180 These dumps were then inserted into a database. This process resulted in a dataset of 39,208,980 articles. There are currently 249 ISO-3166-1 country codes. 181 If we were to check every article from every one of these countries, we would need to analyze almost 10 billion time series. Even analyzing at a speed of 100 time series per second, it would take more than three years of computation time to check all article-country pairs. For this reason, combined with the presumption that there are far fewer articles that have been blocked than articles that have not, we chose to limit our analysis to a smaller set of articles.
There were a number of methods we could have used to generate the list of articles to be analyzed. The most obvious of these methods is a random sample, but given our assumption that the set of censored articles is much smaller than the set of uncensored articles, we wanted a way to oversample the set of likely censored articles. One method would be the manual curation of a list of articles deemed more likely to see censorship. This manual curation exercise is something the Berkman Klein Center is intimately familiar with. In the past, through the OpenNet Initiative (ONI) and related projects, the Berkman Klein Center has spent months collecting and categorizing such lists. Like this project, lists for these prior projects were meant to cover dozens of countries. To accomplish this, we primarily utilized a network of on-the-ground experts to develop countryspecific lists. This was a large and slow endeavor. We therefore knew that the manual assembly of a sizable corpus of articles across all of Wikipedia's projects could become an intensive project in itself. We wished to avoid the large time and effort costs of our previous methods while still utilizing some of this previous work.
The lists of URLs we crafted for ONI were largely irrelevant to this project, but the categories of content that more often saw censorship were highly relevant. ONI categorized content into four broad categories: "Political," "Social," "Conflict/Security," and "Internet Tools". Internally, each of these categories contained a number of more specific topics (a total of 37 topics in all). For example, content in the Political category related to one of twelve topics (freedom of expression, women's rights, political reform, etc.) and the Social category contained nine topics (family planning, pornography, gambling, drugs, etc.). We had tweaked our taxonomy since the conclusion of ONI to contain 40 topics within the same four categories. Past ONI research indicated that each of the four broader categories saw pervasive censorship in at least one country. 182 Research on censored Wikipedia articles in the past has uncovered topics that were broadly similar to our own. 183 For these reasons, we decided that any method for constructing a set of articles must generate a set such that all four of our broad categories and a majority of the 40 topics within our taxonomy received analysis.
With the constraints that our sampled set must contain articles from a number of specific topics, that it must touch on most, if not all, of Wikipedia's projects, and that it must not become a sizable project unto itself, we designed an article selection method. The method we chose is as follows: we manually collected a list of "seed" articles that we deemed more likely to see censorship actions as the basis of our article set; we then added to this set all translations of these seed articles; we finally added all articles that were directly linked to by those already in our set. The use of a seed set would limit the manual curation we would need to perform, collecting translations would broaden coverage across Wikipedia projects, and link traversal would dramatically increase the number of articles in the set while still retaining some level of semantic relatedness.
The translation and link traversal steps were straightforward to implement, but the creation of the seed list still needed some level of manual curation. To develop a list of articles likely to see censorship, we researched lists of Wikipedia articles that have seen censorship in the past. There are only a few such lists: a China-centric list that GreatFire.org maintains and checks regularly for censorship, 184 a small Persian-centric list developed by Small Media, 185 and a larger Persian-centric list developed as part of a research project at the University of Pennsylvania. 186 Unfortunately, the public availability of the Persian-centric lists was discovered after much of our analysis was already complete, 187 so of the existing lists, we only used GreatFire.org's as a contributor to our seed set.
The GreatFire.org list contained coverage of many of the topics in our Internet Tools category ("Tor," "Facebook," "YouTube," "WeChat," etc.), but most of the other articles were related to matters specific to China ("Tibetan Buddhism," "Tiananmen Square," "Falun Gong," etc.). To increase the coverage of our target topics, we decided to add all articles included in English Wikipedia's list of controversial issues (and articles that included the "Controversial" template). 188 ("Controversial" in this sense refers to a high incidence of "edit wars," where edits to articles are repeatedly made and reverted.) The list of controversial issues included articles that covered 25 of our 40 targeted sensitive topics, with especially good coverage of social and political issues. Coupled with GreatFire.org list, this met our goal of covering our four broad categories and more than threequarters of the more specific topics within our taxonomy. Past research has also found a correlation between controversial Wikipedia articles and state censorship efforts, at least for the Iranian case. 189 The topic coverage of our article set and this past research gave us some confidence that our selected sample of articles would contain a higher proportion of censored articles than a purely random sample.
After assembling and cleaning the combined GreatFire.org and controversial articles lists, our seed set contained 2,933 articles. Fetching all translations of these articles expanded our set to 44,611 articles. Using the MediaWiki API 190 and the mwclient library, 191 we then added to our set all Wikipedia articles to which these 44,611 directly link. This resulted in a set of 1,722,543 articles. These 1.7 million articles became our top priority for data collection and analysis. This set included articles from 286 distinct Wikipedia projects (out of the total 292), and 132 projects were represented by more than 10,000 articles.
We then attempted to locate data on the number of requests per day for each of these articles from every country. We had a particular date range in mind when looking for this data. If Wikipedia could identify from their own data the articles that were likely censored, they might be able to infer motivations of the censors. When Wikipedia moved to providing content solely by HTTPS in June 2015, censors likely lost the ability to discriminate between articles they wanted to censor and articles they did not. This meant the change to HTTPS-only content delivery likely caused Wikipedia to correspondingly lose their window into some of the intentions of the various censoring bodies. Because of this possibility, we chose to look most closely at article-specific censorship prior to the June 2015 change to HTTPS.
For privacy reasons, Wikimedia does not publicly release request data separated out by both article and country, so we were granted research access to one of Wikimedia's internal research databases under a non-disclosure agreement. Upon entering the database, we discovered that data of this kind was only available from May 10, 2015 onward. We hypothesized that the transition Wikipedia made to HTTPS-only for all its projects in mid-June 2015 would eliminate much of the article-level censorship, and that therefore, we would only have been left with useful historical data from mid-May 2015 to mid-June 2015. As our workflow was designed primarily to locate the beginning of censorship events, it was deemed that the effort and time required to look for the beginning of censorship events in the four week window from mid-May to mid-June 2015 would not have been effort well spent.
The time and effort required to extract data from Wikimedia's internal research database played a part in this calculation. A query to extract one year's worth of daily request counts for a single article took approximately ten minutes. This query time would have been less of an issue if we had extracted data for many articles at once, as querying for multiple articles in the same query did not significantly increase the response time, but we then would have faced the issue of managing and querying against a large quantity of exported data on infrastructure we had little ability to control. We did explore using scratch space within the same database infrastructure to manage this exported data, but the database software introduced significant time overhead in even simple queries against this relatively small dataset that would have created a large bottleneck in our analysis pipeline.
Instead of investing significant work for four weeks of data, we shifted our focus to sources with more historical coverage. We identified four publicly available sources of historical article-level data: the Wikimedia Pageview API, the http://stats.grok.se site, the Wikipedia "pagecounts-raw" dumps, and the Wikipedia "pagecounts-ez" dumps. Because all these sources are meant for public consumption, much of the granularity had been removed prior to publication to protect user privacy. 192 That is most notable for this project because requests were no longer broken out by both article and the geographic location of the request. This meant that any analysis performed on the data could not connect censorship events directly to the countries within which the censorship was likely taking place. This was an unfortunate concession that needed to be made in order for our analysis to continue. With that decision made, we continued to evaluate the utility of the various data sources.
The Wikimedia Pageview API was quickly eliminated because its historical data starts July 1, 2015, which is after the transition to HTTPS. As we knew we only wanted data on 1.7 million of approxiately 40 million articles, and that we wanted this data daily rather than hourly, we chose to use the http://stats.grok.se API. This was convenient because we wouldn't need to download data for articles we were not interested in, and stats.grok.se had already aggregated requests by day whereas the dumps provided hourly data. stats.grok.se also provided the benefit of historical data back to December 2007.

Analyzing Accessibility of Wikipedia
Projects Around the World

INTERNET MONITOR
Unfortunately, it quickly became clear that fetching data from the stats.grok.se API at the volume we needed was much too slow. We turned our attention to the pagecount data dumps. It was determined that the "pagecounts-raw" data was larger than we could handle with our storage infrastructure, which left the "pagecounts-ez" dumps as the most suitable source of data. This data came with a cost: instead of historical data back to December 2007, which both stats.grok.se and the "pagecounts-raw" dumps provided, the "pagecounts-ez" data only existed from November 2011 onward. This meant that we would not have data on censorship events that might have occurred in the period between December 2007 and November 2011. Again, this concession was necessary for our analysis to continue.
We downloaded the "pagecounts-ez" data (through the your.org mirror, 193 which provided substantially greater speeds), pulled out data on the 1.7 million articles we had selected earlier, and reaggreated the number of requests by day rather than by hour. Now that we had data to analyze, we began the anomaly detection process. The intent of this analysis was to locate possible article censorship events. We used anomaly detection not as a statistical tool, but rather as a search heuristic to find events worth investigating. The core of this process was the Robust Principal Component Analysis (RPCA) anomaly detection algorithm. 194 This algorithm was chosen because it is moderately fast, works well across various types of time series, provides feedback on how anomalous each data point is, has open source implementations in multiple languages, 195 and was being used in production at Netflix. 196 When initiating this project, we designed our analysis pipeline around the constraint that much of the data could not leave Wikimedia's servers, and that therefore much of the analysis itself would need to take place on Wikimedia's servers. We therefore chose a pipeline architecture that was easy to deploy in an environment over which we had little control. Many of the requirements of this pipeline were fulfilled by Mozilla's Heka project. 197 Heka was attractive because it is written in Go, which meant we could simply compile a dependency-less binary, copy it to Wikimedia's servers, and feed it Wikimedia's data. For this to happen, we needed to have a version of the RPCA algorithm we could compile into Heka. That necessitated porting the RPCA algorithm to Go, which we did. 198 We further customized the Heka pipeline by adding the ability to aggregate data at different timescales. We also created two modules: one that grouped consecutive anomalous measurements together into multi-day anomalous events and computed a score for each event based on an aggregation of each constituent day's anomalousness and the total duration of the anomaly, and one to output anomalies With everything in place, we ran each of our 1.7 million time series through the anomaly detection pipeline and output all the resulting anomalous events to an Elasticsearch index. We ultimately ended up with about 92.4 million anomalous events, 18 million of which were scored less than zero, indicating that the observed number of requests was lower than what the RPCA algorithm would have expected given the article's history. We then began looking through the events. Our first observation was that many of the most extreme anomalous events were short events that occurred at the same time across all or almost all articles. These events are outlined in Appendix D. We surmised that these were likely data collection issues on Wikimedia's side rather than actual Wikipedia-wide request drop offs. While Wikimedia does document many of their data collection issues, 201 we were unable to locate issue documentation going back far enough in time to confirm our suspicions. These events were excluded from further analysis.
Once those events were excluded, we generated graphs of the 500 anomalies with the lowest scores per Wikipedia project. We observed that many of the top anomalies were for articles with very little request traffic, and it appeared as though the algorithm was therefore picking up any traffic to these articles as anomalous. These results were undesirable, so for all articles that contained at least one anomaly with a score less than zero, we computed the median number of requests per day. We then regenerated our graphs for the 500 lowest scoring anomalies per project, but only considered articles that had a median of ten or more requests per day. 37 projects contained zero articles that fit that description, 118 projects contained more than zero but less than 500, and 137 projects contained 500 or more. Altogether we graphed 95,603 anomalies. To ensure we did not miss any notable events in large projects, we also graphed the 5000 anomalies with the lowest scores across all languages with more than 100 median requests per day. In total we selected and graphed 100,603 anomalies.
We then manually reviewed these anomaly graphs. This process was meant to familiarize us with the kinds and categories of anomalies we might be detecting, and how the shapes of these anomalies might indicate different phenomena. We investigated many anomalies for each of the various types of shapes that we saw. We could only find strong evidence of the processes backing two kinds of anomalies: national holidays and article editing events internal to Wikipedia. National holidays often see articles drop off quite dramatically (but never instantly), and they often recover just as quickly. In graphical form, they often look like sharp letter V's. Article editing events like moves, deletions, and redirects were behind a large number of detected anomalies in the number of requests. The detected changes could be negative or positive depending on the nature of the change. When searching Analyzing Accessibility of Wikipedia Projects Around the World

INTERNET MONITOR
through the detected events for anomalies that might constitute blocking, we had to consider the fact that editing events could look very similar to our hypothesized censorship events.
We hypothesized that a censorship event would look like a relatively stable number of requests followed by an instant drop that stays stable for some period of time. If the article were to become unblocked, we also hypothesized that the number of requests would more slowly increase back to somewhere near pre-censorship levels. While reviewing the graphs, we paid special attention to graphs with this shape. We pulled many of these out for further investigation, along with other anomalies that did not look like they could be organic traffic patterns. We often chose not to select anomalies that occurred around national holidays and had the deep V shape we had previously associated with holidays. Altogether, we pulled out 1,288 anomalies. We then grouped anomalies that started on the same date and had similar shapes. For each of the groups of anomalies and the anomalies that did not fall into groups, we organized them into either high or low investigatory priority based on a number of judgements. First, we attempted to find anomalies that indicated a fast and severe drop in traffic that lasted for more than a single day. We also hypothesized that where traffic dropped completely to zero, these were likely data collection errors, and so we focused on anomalies where traffic dropped significantly but not completely. We attempted to find anomalies that occurred together in time, as we believed censors are likely to block more than a single article at a time. For anomalies that occurred together in time, we wanted to make sure they were limited to only a small number of languages and did not occur widely across all of Wikipedia's projects because we believed censors were more likely to target only the languages spoken within their country and that wider spread anomalies were more likely to be data collection issues. For the anomalies that remained, we considered them more likely to be related to censorship if the articles were thematically related, as synchronized anomalies for apparently unrelated articles could be caused by a larger number of processes (bots ceasing operations, the blocking of high-volume requesters, network outages, etc.). Once this review process was complete, we were left with 441 high priority anomalies and 847 lower priority anomalies. We then further investigated the high priority anomalies and only looked at lower priority anomalies that related to our countries of interest.
For the high priority anomalies, further investigation consisted of a number of steps. First, for each anomaly or group of anomalies, we queried and graphed the twenty anomalies with the largest drop offs starting on the same day for the same language or languages. We then reviewed these new anomalies for any that might fit the pattern of the high priority anomaly. Any that fit were added to the anomaly group. Then, for each anomaly or group of anomalies, we spot checked the histories of a number of articles to identify any events that might have been the cause of the anomalies. We checked both the edit history page of the article itself as well as the public log of the appropriate language. 202 If an editing event that could cause a significant decrease in traffic (page move, deletion) took place at the same time as a detected anomaly, we assumed this event was the cause of the anomaly and ruled it out as a potential censorship event. This step eliminated many of the top anomalies. Our next step was to confirm that the anomalies were not taking place on national holidays. For this step, we relied heavily on timeandate.com, which maintains a list of dates of historical holidays for many countries around the world. 203 For the anomalies that remained, if their article titles were not in English, we translated them using Google Translate 204 as a first pass and then located the English translation of the article in question for those translations that we considered suspect. Once we had translations, we took special care investigating anomalies in articles that looked to be thematically similar. The result of these investigations constitute the bulk of the article-level results presented above. As the process above illustrates, we chose to be conservative rather than comprehensive in our results.