Apache Spark™ - Unified Engine for large-scale data analytics
Run now Install with 'pip' $ pip install pyspark $ pyspark Use the official Docker image $ docker run -it --rm spark:python3 /opt/spark/bin/pyspark QuickStart Machine Learning Analytics & Data Science df = spark.read.json("logs.json") df.where("age > 21").
spark.apache.org
1. What is Apache Spark ?
Apache Spark is a multi-language engine for executing data engineering, data science, and machine leanring on single-node machines or clusters.
2. Key features
1) Batch/Streaming data
2) SQL Analytics
3) Data Science at scale
4) Machie learning
3. Why Spark
The most widely used engine for scalable computing
Thousands of companies, including 80% of the Fortune 500, use Apache Spark.
Over 2000 contributors to the open source project from industry and academia.
Apache Spark integrates with your favorite frameworks, helping to scale them to thousands of machines