In order to measure the social impact that each story might have had, we use Bit.ly's API to get the total click count of each story by reverse-matching the story's URL with Bit.ly shortened link(s) and then adding up the click counts of each of those shortened links together.
After collecting and extracting the story, we postpone the collection of the click count for 3 and 30 days, i.e. we collect the total click counts for the story twice - after 3 and after 30 days since its publication (or collection) date.
Fetching the click count 3 days after story publication allows us to get some link impact data sooner than later, and at 30 days we refetch the count to get the "full" click count as we found out that most of the Bit.ly link clicks happen within 30 days since story gets published.
-
After extracting a newly collected story,
::DBI::Stories::process_extracted_story()
calls::Util::Bitly::Schedule::add_to_processing_schedule()
which adds the story to Bit.ly's processing schedule. -
add_to_processing_schedule()
adds two rows for each story tobitly_processing_schedule
table in order to (re)fetch the total story's click count after 3 and 30 days since its publication or collection date. -
mediawords_process_bitly_schedule.pl
is being run periodically and adds due stories frombitly_processing_schedule
table to::Job::Bitly::FetchStoryStats
job queue for the click count to be (re)fetched. -
::Job::Bitly::FetchStoryStats
worker fetches a list of days and clicks from Bit.ly API, stores the raw JSON response on Amazon S3, and adds::Job::Bitly::AggregateStoryStats
job which will in turn process the raw data just fetched (add daily click counts into a single total click count). -
::Job::Bitly::AggregateStoryStats
fetches the raw JSON response from S3, adds daily click counts together to get the total story's click count, stores the total count inbitly_clicks_total
table, and lastly adds the story to the Solr processing queue so that the story gets reimported into Solr with the total click count now being present.
-
By default, Bit.ly statistics collection is postponed 3 and 30 days from story's publication date (
stories.publish_date
). However, if said date doesn't look valid (is before year 2008 when Media Cloud project started, or in the future), Bit.ly scheduler falls back to story's collection date (stories.collect_date
) as it is assumed that the crawler collects the story pretty quickly after it appears online so the collection date is pretty close to story's actual publication date. -
For stories more than 30 days old, we just immediately queue a single job to collect the data.
-
If a the
skip_bitly_processing
field on a feed is true, bitly data is not added for stories in that feed (we use this field to prevent the system from trying to collect data on imported stories). -
A simple LRU disk cache sits in front of the bitly JSON storage, so quick repeated fetches of the JSON during the processing steps does not require repeated fetches from the content store (which is Amazon S3 in our production).