ICLab Data

ICLab has been running and measuring Internet censorship since late 2016. We are happy to share the analyzed data that we use in our recent paper: ICLab: A Global, Longitudinal Internet Censorship Measurement Platform, accepted to the IEEE Symposium on Security and Privacy 2020.

Our data is hosted on several platforms for public access. Please contact us if you encounter any issue when downloading the data.

Our new public data format is CSV-format text files, encoded in UTF-8. Fields are separated by commas and quoted with ". Caution: some fields can contain commas; use a true CSV parser, don’t just split lines on /,/. ​

  1. filename - Name of the raw data file (for internal use)
  2. server_t - Date and time when the measurement was conducted, in ISO 8601 format (e.g., 2017-01-01T00:03:55.797+00:00)
  3. country - Country where the measurement was conducted, as an ISO 3166-1 alpha-2 country code
  4. as_number - Autonomous System number of the network from which the measurement was conducted
  5. schedule_name - Internal label for the list of URLs tested in this measurement (e.g., alexa-global, citizenlab-global)
  6. url - URL on the test list
  7. domain - Domain name of the URL on the test list
  8. final_url - URL reached by the client after following HTTP redirections
  9. sanitized - Whether the location of the VPN is sanitized.
  10. dns - Outcome of the DNS tampering analysis: one of the codewords normal, tampered, or uncertain.
  11. dns_reason - Details of the DNS tampering analysis
  12. field_answers - DNS responses from the field measurement.
  13. field_nameserver - The nameserver that the DNS query was sent to in the field measurement.
  14. control_answers - DNS responses from the control measurement.
  15. control_nameserver - The nameserver that the DNS query was sent to in the control measurement.
  16. http_status - HTTP status code for the final page load in the redirection chain (e.g., 200, 451)
  17. blockpage_reason - none if no blockpage was detected, or an internal identifier for the regular expression that matches this type of blockpage
  18. packet_injection - Outcome of the packet injection analysis: one of the codewords censored, missing data, none, not censored, probably censored, or uncertain.
  19. packet_field_category - Classification of any anomalous packets observed during this measurement.
  20. ports - The port on which the packet anomaly is happening.
  21. packet_control_category - Classification of any anomalous packets observed during a matching measurement of this site from a control mode.
  22. censored - Final assessment of this measurement: true or false for censored or not censored. ​

Caution: Our block-page detection regexps are known to trigger on some sites that refuse access to clients from specific countries and/or when they detect use of a VPN, as well as block pages actually injected by a censor in an intermediate network.
It is debatable whether refusal of access by a site for these reasons should be considered censorship; we are currently counting them as such in the censored column and our summary statistics.

Our older data (prior to 2020) is in CSV format with the following columns:

  1. filename: name of raw data file (for internal use)
  2. server_t: the timestamp of when the measurement was conducted (e.g., 2017-01-01T00:03:55.797Z)
  3. country: country code ISO alpha-2
  4. as_number: Autonomous System Number
  5. schedule_name: web test lists( i.e., Alexa global top list, CitizenLab, or Berkman center)
  6. url
  7. dns
  8. dns_reason: true = manipulated, false = unmanipulated
  9. dns_all
  10. dns_reason_all
  11. http_status
  12. block: true = blockpages, false = normal
  13. body_len
  14. http_reason
  15. packet_updated: true = injected, false = no injection
  16. packet_reason
  17. censored_updated: true = censored, false = uncensored
by Calipr Networking Group