From Videos to Channels: Archiving Video Content on the Web - Thomas Drugeon, INA

From Videos to Channels: Archiving Video Content on the Web - Thomas Drugeon, INA

IIPC 2019 From videos to channels: archiving video content on the web - Thomas Drugeon, INA @inadlweb Video capture evolution @INA radio france arte radio 410 000 hours of video collected in 2018, the equivalent of 47 TV channels collected 24/7 twitter arte tv france tv canal+ vine facebook soundcloud arte concert wat/tf1 vimeo Ina.fr taratata curlturebox youtube dailymotion @inadlweb Video capture evolution @INA radio france arte radio 410 000 hours of video collected in 2018, the andequivalent as of of June 47 TV channels 2019... collected 24/7 twitter ● 22 million videos arte tv ● 2.3 million hours france tv ● 1.1 PB of data canal+ ● 1 TB per day on average (2018)vine ● 3.8 million unique channels facebook ● 7 500 channels comprehensively collected soundcloud arte concert ● 17 providers,wat/tf1 13 still active vimeo Ina.fr taratata curlturebox youtube dailymotion @inadlweb Video URN ● Uniquely identifying videos and channels for sourcing, crawling and access purposes ● Part of our global dlweb URN scheme ● Videos ○ web:av:youtube:jNQXAC9IVRw ○ web:av:twitter:560070131976392705 ● Channels / Users / Accounts / Authors ○ web:user:facebook:france2 ○ web:user:youtube:UCPlxmWbGbwVM2baH8q7MK6w @inadlweb Video sourcing ● Direct listing of publication for a given channel ○ From 7500 manually sourced channels (TV/radio related or native) ○ Specific to each provider: APIs, feeds, scraping, ... ● Continuous archive recrawl ○ DAFF-tool detecting video ids in feeds and HTML pages ○ Also searching for video embedded in tweet objects → URN list sent to video capture system @inadlweb Video capture ● In-house dedicated crawling system with specific code for each platform (DevOps) ● Each unique video URN is crawled once ● Around 1TB per day on avg (2018) ● Data selection (format, resolution, …) ● Metadata normalization ● Archive with the LAP (Live Archiving Proxy) ● DAFF storage (1.1PB) @inadlweb UGC Monitor: download process collect video title, description, video qualities video availability metadata = user id + + video urls representative select quality = quality download collect file(s) decryption streams = + merge Metadata normalize repackage = segments/tracks + + extraction (ffprobe) archive = through LAP @inadlweb Access ● URN identification ● Dedicated streaming engine ○ Same webservice and player for all INA medias (TV/radio have different URN namespaces) ○ On-the-fly conversion (mp4, flv, webm, ogg, …) ○ On-the-fly extraction (image, preview, waveform, …) ● Integration ○ Web archive browsing (video id detection) ○ Twitter & Facebook search ○ Dedicated video search engine (metadata) ● <dlweb-preview-video urn="..."> web component @inadlweb Access Demo https://youtu.be/qlOY8dWSiMA @inadlweb Video preview web component @inadlweb Issues and future considerations ● Obsolete video codec (eg vivo) calling for batch conversions ● Best-effort and platforms goodwill ● DRM even for free contents: rationalisation ● youtube-dl @inadlweb.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    11 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us