27-29 November, Vilnius
Conference about Big Data, High Load, Data Science, Machine Learning & AI
Conference is over. See you next year!
Julien Nioche runs DigitalPebble Ltd, a consultancy based in Bristol, UK and specialising in open source solutions for text engineering. He is a member of the Apache Software Foundation, a committer on Apache Nutch and various other projects. His expertise covers web crawling, natural language processing, machine learning and search.
Introduction to web crawling with StormCrawler (and Elasticsearch)
In this workshop, we will explore StormCrawlera collection of resources for building low-latency, large scale web crawlers on Apache Storm. After a short introduction to Apache Storm and an overview of what Storm-Crawler provides, we’ll put it to use for a simple crawl before moving on to the deployed mode of Storm.In the second part of the session, we will introduce metrics and index documents with Elasticsearch and Kibana and dive into data extraction. Finally, we’ll cover recursive crawls and scalability. This course will be hands-on: attendees will run the code on their own machines.