Orwell

28th June 2021

Everchanging content and locality of resources on the Internet pose a challenge for clues and evidence collection. For instance, incriminating social network post available today is seldomly available tomorrow. The situation is even more
complicated with overlay networks such as Tor (and their dark marketplaces, webs with abusive child material, etc.).
In the case of programmatic processing of web content, you must take into account that:

  • parsing information with regular expressions is not feasible anymore when the content is dynamically rendered with JavaScript;
  • client-side single-page applications need a completely different approach for decoding and archiving;
  • rate-limiting, captchas, and authentication are employed on most of the webpages as the protection against crawling between links

Usage

ORWELL framework allows you to:

  1.  decode and parse the webpage programmatically for both human-readable and machine content;
  2. crawl and discover new links and resources;
  3. archive obtained data in a transferable form guaranteeing integrity

ORWELL offers a versatile toolbox for web content. Here goes a list of usual ORWELL use-cases:

  • You have a ransomware note present on the autogenerated domain name, and you need to create a snapshot.
  • An anti-governmental social network post receives a lot of attention; you are tasked to collect and analyze responses before this post disappears.
  • You have a long-term investigation of some dark marketplace running, and you need to collect meta-data about its operation periodically (e.g., listed items and their prices, seller’s details).
  • You have an unknown cryptocurrency address, and you would like to check its appearance on relevant webpages through their search engines

Features

A lot of HTTP servers are employing defense against web crawlers by limiting the number of requests per visiting IP address. ORWELL bypasses this by load-balancing its traffic via a set userdefined proxies.  

 

ORWELL offers integration with anticaptcha solutions (both manual and automatic). Moreover, another ORWELL hooks can be used to pass authentication and authorization credentials (such as username and password) allowing ORWELL to browse and access private pages.

ORWELL stores data into Mozilla Archive Format, which offers the most accurate visualization of embedded HTML resources (e.g., pictures, animations, CSS, JavaScripts, etc.) comparing to plain HTML and MHTML. Apart from the decoded webpage, the ORWELL archive contains two additional tabs:

  • first with a webpage screenshot as seen by a browser;
  • second with a metadata such as IP address, timestamp, integrity seal, parsed data
 
ORWELL governs crawling through the webpages leveraging different strategies (e.g., bread-first search, embedded search fields).