Spark 2 and Data science at scale
Matt Brandwein @Cloudera
Spark without Map/reduce, nor hadf
Dataset API (RDD+Dataframes)
Structured Streaming
From PoC to Prod, all using Spark, dependance package
- Reporting and Dashboarding
- Batch pipeline/scoring (ETL)
- On-line serving app
scikit doesn’t work with Spark
solve conflict between data scientist and IT=> familiar env
Cloudera Data Science Workbench
All packages installed are persistent, ready for sharing. Cloudera architect