Processing big data in real-time is challenging due to scalability, information consistency, and fault tolerance. This course shows you how you can use Spark to make your overall analysis workflow faster and more efficient. You'll learn all about the core concepts and tools within the Spark ecosystem, like Spark Streaming, the Spark Streaming API, machine learning extension, and structured streaming.
- Write your own Python programs that can interact with Spark
- Implement data stream consumption using Apache Spark
- Recognize common operations in Spark to process known data streams
- Integrate Spark streaming with Amazon Web Services
- Create a collaborative filtering model with Python and the movielens dataset
- Apply processed data streams to Spark machine learning APIs
For an optimal student experience, we recommend the following hardware configuration:
- Processor: Intel Core i5 or equivalent
- Memory: 4 GB RAM
- Hard disk: 40 GB available space
- An Internet connection
You’ll also need the following software installed in advance:
- Operating System: Windows 7 SP1 64-bit, Windows 8.1 64-bit or Windows 10 64-bit
- Browser: Google Chrome, Latest Version
- PostgreSQL 9.0 or above
- Spark 2.3.0
- Amazon Web Services (AWS) account