ECS 119: Data Processing Pipelines

Subject
ECS 119
Title
Data Processing Pipelines
Status
Active
Units
4.0
Effective Term
2024 Spring Quarter
Learning Activities
Lecture: 3 hours
Discussion: 1 hour
Description
Introduction to software systems for processing large datasets. Hands-on experience with scripting, data streams, distributed computing, and software development and deployment infrastructure. GE: SE.
Prerequisites
ECS 116 or ECS 165A
Enrollment Restrictions
Pass One restricted to Data Science majors; Pass Two restricted to undergraduate students. Not intended for Computer Science or Computer Science & Engineering majors.

Summary of Course Content:

Tools, Processes and Ethical Considerations associated with building systems that use large datasets. 

  1. Basics of big data processing
    1. Sources of big data: databases, real-time data
    2. Pipelining in Unix
    3. Scripting in Python
  2. Software engineering tools
    1. Debugging and IDE
    2. Version control, eg. Git
    3. Virtualization e.g., Docker
    4. Orchestration, e.g., Puppet, Salt.
  3. Organizing teams for Big Data Systems.
    1. Deciding Goals & Requirements
    2. Ethics of Big Data Systems
    3. Testing and Quality assurance
  4. Programming for streaming data
    1. Data streams in Python
    2. Data streams in R
  5. Distributed processing
    1. Parallel algorithms and parallel thinking
    2. MapReduce/Hadoop processing model
    3. Distributed files and the Hadoop Distributed Filesystem
    4. Communication between nodes
    5. Distributed data structures, and distributed streaming, eg. SPARK, Yarn

Illustrative Reading:

Nandhini Abirami R, Seifedine Kadry, Amir H. Gandomi, and Balamurugan Balusamy, Big Data: Concepts, Technology, and Architecture, Wiley; 1st edition (April 13, 2021) ISBN-13: 978-1119701828

Potential Course Overlap:

This course has minimal overlap with ECS 158; it focuses on software development at a higher level of abstraction for big data processing systems, and will not cover low-level primitives such as locks, shared-memory concurrency, and message passing. This course overlaps with ECS 161 in covering the basics of version control and build systems.

Final Exam:
Yes Final Exam
 

Course Category