03 April 2017

Spark 2 and Data science at scale

Matt Brandwein @Cloudera

Spark without Map/reduce, nor hadf

Dataset API (RDD+Dataframes)

Structured Streaming

From PoC to Prod, all using Spark, dependance package

  1. Reporting and Dashboarding
  2. Batch pipeline/scoring (ETL)
  3. On-line serving app

scikit doesn’t work with Spark solve conflict between data scientist and IT=> familiar env
Cloudera Data Science Workbench

All packages installed are persistent, ready for sharing. Cloudera architect



blog comments powered by Disqus

Number of visits: - |