project old tech new ideas Mar 21, 2016

How Many Websites Provide RSS / Web Syndication Feeds

I became interested in web syndication feeds recently. It's an old technology nobody talks about anymore, but everybody still provides them. Taking a look at Google Trends shows us how much people have lost interest in RSS since its peak popularity in 2006:

It's an old technology people seem to have lost interest in, but can we still rely on it? Are people abandoning web syndication? With the advent of CMS (eg. the huge popularity of WordPress), static website generators and publishing platforms (eg. Medium) which all provide syndication feeds by default, RSS doesn't look dead to me.

This is how I became interested in finding how many websites actually do provide at least one XML syndication feed. Builtwith (opens new window) has a pretty nifty "trends" section and it states that 33% of the 1 million most visited websites have an RSS feed. While this gives us a pretty good idea, let's see what we could do by ourselves.

We basically have two options here:

Actually crawling the web - a costly and lengthy process
Relying on existing web crawl data

Enters CommonCrawl. CommonCrawl is a non-profit founded on the exciting project of crawling tons of web pages and releasing the obtained dataset publicly and for free. Their latest dump was published in November 2015 (opens new window) and contains 1.82 billion web pages, amounting to over 151TB of highly compressed HTML. I half-jokingly said this before: why not mining CommonCrawl to answer my own question?

A few weeks ago two colleagues of mine mentioned over lunch their desire to mine Common Crawl for their research, and it was a coincidence that I had been thinking about doing the same (though mostly for fun) for a few months. Wouldn't have we been able to combine our efforts into a single "Common Crawl run", I most probably wouldn't have mined this dataset all by myself only to satisfy my curiosity.

Working with CommonCrawl

As mentioned before, the latest Common Crawl is 151TB of data hosted on S3. We decided to process it directly from a few EC2 spot instances using the WDC Extraction Framework (opens new window). Actually this choice was pretty obvious to us because the syndication feeds are not the only thing we wanted to extract from CommonCrawl. My colleagues were interested in extracting all structured data (opens new window) embedded in any HTML page which is exactly what the WDC framework was designed for. We also wanted to extract all HTML anchors pointing to a Wikipedia page. We forked the WDC framework (opens new window) and modified it to extract these different things all at once, thus only needing to go through the CommonCrawl dataset once.

Now, CommonCrawl dump consists of HTML web pages and we want to extract parts of it, eg. any <link rel="alternate" href="/blog/atom.xml" title="something" type="application/atom+xml"> with either RSS/Atom type attribute or rel="alternate" attribute. Although parsing HTML with regex is a bad idea (opens new window), constructing a full-blown DOM tree out of each of these 1.82B pages would require far too much time and processing power.

Here's the regular expression I used to match feeds:

(<link[^>]*(?:\s(?:type=[\"']?(application\/rss\+xml|application\/atom\+xml|application\/rss|application\/atom|application\/rdf\+xml|application\/rdf|text\/rss\+xml|text\/atom\+xml|text\/rss|text\/atom|text\/rdf\+xml|text\/rdf|text\/xml|application\/xml)[\"']?|rel=[\"']?(?:alternate)[\"']?))[^>]*>)"

The WDC framework wrote its results to our S3 bucket in csv.gz format. For the whole run we budgeted 500USD for AWS EC2 instances. We spawned 100 c3.4xlarge (opens new window) EC2 instances (16 cores each) to run our modified WDC framework. Using spot instances saved us some money. It took around 30 hours to process the whole Common Crawl dump and costed less than 450USD, we were right on target.

Most interesting is of course post-processing the data, not extracting it. Here is what our results S3 bucket looks like together with a short explanation of what the result of our various extractions are:

2.3G    /WDC_112015/anchors
60G     /WDC_112015/feeds
54G     /WDC_112015/urls
346G    /WDC_112015/data
408G    /WDC_112015/anchor_pages
37G     /WDC_112015/stats
905G    total

1
2
3
4
5
6
7

(gzipped TSV) anchors: page_url | anchor text | wikipedia_url
(gzipped TSV) feeds: page_url | link_tag | link_type
(gzipped TSV) urls: page_url
(gzipped text) data: <quintuplet>\n (subject, predicate, object, page, extractor_used)
(gzipped json) anchor_pages: {"url": page_url, "content": full_html_page}\n
(gzipped TSV) stats: arcFileName | arcFilePos | detectedMimeType | hostIp | html-head-meta | html-mf-adr | html-mf-geo | html-mf-hcalendar | html-mf-hcard | html-mf-hlisting | html-mf-hrecipe | html-mf-hresume | html-mf-hreview | html-mf-species | html-mf-xfn | html-microdata | html-rdfa | html-rdfa11 | mimeType | recordLength | referencedData | timestamp | totalTriples | uri

First step was of course to sync this bucket locally and backup it to our NAS. Once this done and after having triggered security warnings at the network admins office for downloading almost 1TB at full speed, we proceeded to copy this data to DAPLAB (opens new window) hadoop cluster. DAPLAB is an awesome project aiming at providing a powerful data processing cluster on a freemium and premium basis for companies that cannot afford their own cluster, for researchers and scientists, etc. They also organize weekly hacking sessions to which anyone can attend and get access to the cluster for free. Our research lab has been partnering with DAPLAB since the beginning and DAPLAB infrastructure is a very nice complement to our lab's hadoop cluster. (No more advertising in this post I promise.)

We now have all the data on HDFS, let's process it. Keep in mind this blog post is about the XML syndication feeds and not how and what we did with the other things we extracted (anchors, RDF triples, etc).

[vfelder@daplab ~]$ hdfs dfs -ls /data/WDC_112015/data/feeds | head -n2
Found 35669 items
1.7M 2016-02-28 15:21 /data/WDC_112015/data/feeds/ex_common-crawl_crawl-data_CC-MAIN-2015-48_segments_1448398444047.40_warc_CC-MAIN-20151124205404-00000-ip-10-71-132-137.ec2.internal.warc.gz.csv.gz

1
2
3

We have 35,669 gzipped TSV files. We don't want to work with this format and compression because gzip is slow and CSV/TSV is not ideal to query the data. Also, 35k files is a bit too much for a ~10 nodes cluster. We will convert these 35k gzipped TSV to 1,000 snappy-compressed parquet files.

Why converting 35k gunzip csv files to 1,000 snappy parquet files?

Parquet (opens new window) is better suited for querying (eg. using Hive) than CSV:
- Parquet files contain their schema, CSV don't.
- Parquet files store data by column, CSV is row-based.
- Columns of a parquet file are compressed (each column being compressed according to its data type).
Snappy also offers great advantages, here compared to gzip:
- Snappy compression is orders of magnitude faster than gzip.
- Snappy happily trades compression against read/write speed. After all, when running a job that loads terabytes of data from HDFS on a cluster, we care more about fast read/write than about sparing a few TB.
Less files means more throughput (as long as we have more files than processing cores of course). If we assume decompressing a file takes 100ms, the decompression alone for all 35k files will take a cumulated one hour. Of course the whole process will be distributed and run in parallel but it still constitutes overhead compared to working on only 1,000 files.
We could have chosen Avro, but our schemas being very basic and our need being to query columns, Parquet made more sense.

Remember our TSV "schema" is page_url | link_tag | link_type. It's a good thing we saved the whole <link tag because I realized after the EC2 run it might be interesting to extract a few other possible attributes. I wrote a short Scala script for Spark to extract these things and go from csv.gz to 1,000 .snappy.parquet.

feedsTransform.scalapackage info.exascale.wdctools

import com.netaporter.uri.Uri
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.{SparkConf, SparkContext, sql}
import org.apache.spark.sql.functions._
import scala.language.postfixOps

object feedsTransform {
  def main(args: Array[String]) {
    val conf = new SparkConf()
      .setAppName("TransformFeeds")
      .set("spark.sql.parquet.compression.codec", "snappy") // snappy compression for parquet

    val sc = new SparkContext(conf)
    val sqlContext = new sql.SQLContext(sc)

    // schema for the CSV we'll load
    val feedSchema = StructType(Array(
      StructField("page", StringType, true),
      StructField("tag", StringType, true),
      StructField("type", StringType, true)))

    // read the CSV with our schema using databricks' spark-csv
    val df = sqlContext
      .read
      .format("com.databricks.spark.csv")
      .option("header", "false")
      .schema(feedSchema)
      .load("/data/WDC_112015/data/feeds/*.csv.gz")


    // extract from each page url their hostname
    // some URLs (eg. http://example.com:/foo.html) made Uri crash, I had to implement a regex
    // based check
    val hostnamePattern = "((\\/\\/|https\\:\\/\\/|http\\:\\/\\/)([^\\/\\:]+))"r

    val getHost: (String => String) = (page: String) => {
      val preFiltered = hostnamePattern findFirstIn page
      if (preFiltered.isEmpty) {
        println(s"prefiltering failed: $page")
        ""
      } else {
        val preFilteredString = preFiltered.get
        try {
          val host = Uri.parse(preFilteredString).host
          if (host.isEmpty) {
            println(s"parsing failed: $page prefiltered as: $preFilteredString")
            ""
          } else {
            host.get
          }
        } catch {
          case e: Throwable => {
            val exception = e.toString
            println(s"caught $exception")
            ""
          }
        }
      }
    }

    // extract rel="…"
    val relPattern = ".*rel=[\"']?([^'\"]*)[\"']?.*".r
    val getRel: (String => String) = (tag: String) => {
      tag match {
        // pattern-matching regex matches is so nice
        case relPattern(captured) => captured
        case _ => ""
      }
    }

    // extract href="…"
    val hrefPattern = ".*href=[\"']?([^'\"]*)[\"']?.*".r
    val getHref: (String => String) = (tag: String) => {
      tag match {
        case hrefPattern(captured) => captured
        case _ => ""
      }
    }

    // extract title="…"
    val titlePattern = ".*title=[\"']?([^'\"]*)[\"']?.*".r
    val getTitle: (String => String) = (tag: String) => {
      tag match {
        case titlePattern(captured) => captured
        case _ => ""
      }
    }

    // transform to lowercase
    val getLCType: (String => String) = (str: String) => {
      str.toLowerCase
    }

    // create spark sql user-defined functions for each of these scala function
    val sqlGetHost = udf(getHost)
    val sqlGetRel = udf(getRel)
    val sqlGetHref = udf(getHref)
    val sqlGetTitle = udf(getTitle)
    val sqlGetLCType = udf(getLCType)

    df
      // add hostname column based on page column
      .withColumn("hostname", sqlGetHost(col("page")))
      // add rel column based on tag column
      .withColumn("rel", sqlGetRel(col("tag")))
      // add href column based on tag column
      .withColumn("href", sqlGetHref(col("tag")))
      // add title column based on tag column
      .withColumn("title", sqlGetTitle(col("tag")))
      // replace type column with its lowercase version
      .withColumn("type", sqlGetLCType(col("type")))
      // only output 1,000 parquet files from the 35k csv.gz input files
      .coalesce(1000)
      // write as parquet to my hdfs home folder
      .write.parquet("/user/vfelder/feeds/feedsparsed.parquet/")
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119

These 60G total / 35k .csv.gz, once converted to 1k .snappy.parquet with 4 additional columns, now take 187G. Not bad.

We could now create a Hive table like this:

CREATE EXTERNAL TABLE xi_wdc.feeds (page STRING, type STRING, tag STRING, hostname STRING, rel STRING, href STRING, title STRING) STORED AS PARQUET LOCATION '/user/vfelder/feeds/feedsparsed.parquet/';

EXTERNAL means the data for this table isn't moved by Hive to the tables location, it just stays where it is and Hive loads the data directly from these files. Should we drop this table the data won't be affected.

I also performed the above csv.gz to snappy.parquet conversion + hostname extraction for /data/WDC_112015/data/urls/ which contains all crawled URLs. The Scala program to do this is very similar to the one listed here and can be found here (opens new window).

Hive is nice to run some queries, but Spark SQL is equally nice and generally offers better performances. It also allows me to run queries directly from the ipython notebook I'm writing this blog post from, get the queries results and rework them directly in python.

The first part of this blog post was writting in Markdown in my ipython notebook running pyspark, the rest of this post is playing with the data directly from ipython, writing pyspark.sql queries and executing them directly from ipython, running them with pyspark over the cluster.

I got this handy *sh alias I'm using to spawn a notebook in a screen:

alias ipyspark='IPYTHON_OPTS="notebook --no-browser --ip=localhost --port=1339" pyspark --master yarn-master --conf spark.ui.port=$(shuf -i 2000-65000 -n 1) --num-executors 20 --executor-cores 2 --driver-memory 16g --executor-memory 16g'

Let's work with our extracted feeds and all crawled URLs

First, import some spark-sql libs.

from pyspark.sql import SQLContext
from pyspark.sql.functions import *

def h(n):
    """helper function to display numbers in a human-readable way"""
    return '{:,}'.format(n)

1
2
3
4
5
6

sc is spark context, it's already there because we're running ipython on pyspark.

sqlContext = SQLContext(sc)

We stored the feeds and urls as snappy compressed parquet files. All we have to do is read them as parquet, everything else is taken care of.

feeds = sqlContext.read.parquet('/user/vfelder/feeds/feedsparsed.parquet/')
urls = sqlContext.read.parquet('/user/vfelder/urls/urlsparsed.parquet/')

1
2

Common Crawl said their dump had 1.82B URLs, let's check if that's also the number of web pages we used for our extraction.

{'total number of urls': h(urls.count())}

{'total number of urls': '1,823,130,936'}

Nice, it seems the data matches.

Feeds and URLs files include these two columns : page and hostname, respectively the original URL of the web page and the hostname of this URL. A single web page can provide several feeds but we're only interested by the number of websites which provide at least one feed, so we take the distinct hostnames. Same with URLs: we crawled a lot of URLs but we are only interested by the number of distinct hostnames, to compare the two numbers.

distinct_feeds_count = feeds.select('hostname').distinct().count()
distinct_urls_count = urls.select('hostname').distinct().count()
{
    'hostnames with at least one feed': h(distinct_feeds_count),
    'hostnames processed': h(distinct_urls_count)
}

1
2
3
4
5
6

{'hostnames processed': '25,243,438',
 'hostnames with at least one feed': '10,294,833'}

'{:0.2f}%'.format(float(distinct_feeds_count)/distinct_urls_count*100)

'40.78%'

Ok, around 40% of the crawled websites provide at least one XML feed.

But on average, how many feeds per webpage?

total_feeds_count = feeds.count()
total_urls_count = urls.count()
float(total_feeds_count)/total_urls_count

1
2
3

1.942656384170972

Now it would be interesting to see which standards these feeds implement.

from pyspark.sql import Column
types_grouped = feeds
    .select('type')
    .where(col('type') != ' ')
    .groupBy('type')
    .count()
    .orderBy(desc('count'))
types_grouped.show()

1
2
3
4
5
6
7
8

+--------------------+---------+
|                type|    count|
+--------------------+---------+
| application/rss+xml|877891180|
|application/atom+xml|215504961|
|     application/xml| 20276729|
|            text/xml| 17086738|
| application/rdf+xml| 14669025|
|    application/atom|   545717|
|     application/rdf|   425259|
|     application/rss|   108139|
|            text/rdf|     9637|
|        text/rss+xml|      245|
|            text/rss|      125|
|        text/rdf+xml|       14|
+--------------------+---------+

Let's do some basic stats:

rss_feeds = types_grouped
    .filter('type LIKE "%rss%"')
    .agg({'count': 'sum'})
    .collect()
rss_total = rss_feeds[0].asDict().values()[0]

atom_feeds = types_grouped
    .filter('type LIKE "%atom%"')
    .agg({'count': 'sum'})
    .collect()
atom_total = atom_feeds[0].asDict().values()[0]

feeds_with_type = types_grouped
    .agg({'count': 'sum'})
    .collect()
feeds_with_type_total = feeds_with_type[0].asDict().values()[0]

{
    'RSS': h(rss_total),
    'Atom': h(atom_total)
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

{'Atom': '216,050,678', 'RSS': '877,999,689'}

rss_pc = float(rss_total)/feeds_with_type_total*100
atom_pc = float(atom_total)/feeds_with_type_total*100

{
    '% RSS': '{:0.2f}%'.format(rss_pc),
    '% Atom': '{:0.2f}%'.format(atom_pc),
}

1
2
3
4
5
6
7

{'% Atom': '18.84%', '% RSS': '76.58%'}

alternate_total = feeds
    .select('rel', 'type')
    .where('rel LIKE "%alternate%" AND (type LIKE "%rss%" OR type LIKE "%atom%")')
    .count()
'{:0.2f}%'.format(float(alternate_total)/(rss_total+atom_total)*100)

1
2
3
4
5

'97.09%'

That's about it, my questions have been answered!

Let's quickly recap:

Common Crawl is an excellent free dataset.
WDC Framework is a cool tool but requires some tweaking.
It is possible to extract a fair amount of data from this 151TB dump for less than 500USD of EC2.
Out of 1.82B URLs we got 25.2M different hosts.
Processing this data is really fast on DAPLAB's cluster with my ipython+pyspark setup. This last (alternate_total) query takes <10s.
40.8% of these hostnames provide a web syndication feed.
RSS is the most popular syndication format by far with 76.6%, Atom is at 18.8%.
97.09% of the <link tags pointing to an RSS or Atom feed also specify rel="alternate", as the HTML standard recommends.

#rss #spark #ipython #scala