Librería Portfolio Librería Portfolio

Búsqueda avanzada

TIENE EN SU CESTA DE LA COMPRA

0 productos

en total 0,00 €

HIGH PERFORMANCE SPARK. BEST PRACTICES FOR SCALING AND OPTIMIZING APACHE SPARK
Título:
HIGH PERFORMANCE SPARK. BEST PRACTICES FOR SCALING AND OPTIMIZING APACHE SPARK
Subtítulo:
Autor:
KARAU, H
Editorial:
O´REILLY
Año de edición:
2017
Materia
SQL
ISBN:
978-1-4919-4320-5
Páginas:
358
38,50 €

 

Sinopsis

Apache Spark is amazing when everything clicks. But if you haven't seen the performance improvements you expected, or still don't feel confident enough to use Spark in production, this practical book is for you. Authors Holden Karau and Rachel Warren demonstrate performance optimizations to help your Spark queries run faster and handle larger data sizes, while using fewer resources.

Ideal for software engineers, data engineers, developers, and system administrators working with large-scale data applications, this book describes techniques that can reduce data infrastructure costs and developer hours. Not only will you gain a more comprehensive understanding of Spark, you'll also learn how to make it sing.

With this book, you'll explore:

How Spark SQL's new interfaces improve performance over SQL's RDD data structure
The choice between data joins in Core Spark and Spark SQL
Techniques for getting the most out of standard RDD transformations
How to work around performance issues in Spark's key/value pair paradigm
Writing high-performance Spark code without Scala or the JVM
How to test for functionality and performance when applying suggested improvements
Using Spark MLlib and Spark ML machine learning libraries
Spark's Streaming components and external community packages



Chapter 1Introduction to High Performance Spark
What Is Spark and Why Performance Matters
What You Can Expect to Get from This Book
Spark Versions
Why Scala?
Conclusion
Chapter 2How Spark Works
How Spark Fits into the Big Data Ecosystem
Spark Model of Parallel Computing: RDDs
Spark Job Scheduling
The Anatomy of a Spark Job
Conclusion
Chapter 3DataFrames, Datasets, and Spark SQL
Getting Started with the SparkSession (or HiveContext or SQLContext)
Spark SQL Dependencies
Basics of Schemas
DataFrame API
Data Representation in DataFrames and Datasets
Data Loading and Saving Functions
Datasets
Extending with User-Defined Functions and Aggregate Functions (UDFs, UDAFs)
Query Optimizer
Debugging Spark SQL Queries
JDBC/ODBC Server
Conclusion
Chapter 4Joins (SQL and Core)
Core Spark Joins
Spark SQL Joins
Conclusion
Chapter 5Effective Transformations
Narrow Versus Wide Transformations
What Type of RDD Does Your Transformation Return?
Minimizing Object Creation
Iterator-to-Iterator Transformations with mapPartitions
Set Operations
Reducing Setup Overhead
Reusing RDDs
Conclusion
Chapter 6Working with Key/Value Data
The Goldilocks Example
Actions on Key/Value Pairs
What's So Dangerous About the groupByKey Function
Choosing an Aggregation Operation
Multiple RDD Operations
Partitioners and Key/Value Data
Dictionary of OrderedRDDOperations
Secondary Sort and repartitionAndSortWithinPartitions
Straggler Detection and Unbalanced Data
Conclusion
Chapter 7Going Beyond Scala
Beyond Scala within the JVM
Beyond Scala, and Beyond the JVM
Calling Other Languages from Spark
The Future
Conclusion
Chapter 8Testing and Validation
Unit Testing
Getting Test Data
Property Checking with ScalaCheck
Integration Testing
Verifying Performance
Job Validation
Conclusion
Chapter 9Spark MLlib and ML
Choosing Between Spark MLlib and Spark ML
Working with MLlib
Working with Spark ML
General Serving Considerations
Conclusion
Chapter 10Spark Components and Packages
Stream Processing with Spark
GraphX
Using Community Packages and Libraries
Conclusion
Appendix Tuning, Debugging, and Other Things Developers Like to Pretend Don't Exist
Spark Tuning and Cluster Sizing
Basic Spark Core Settings: How Many Resources to Allocate to the Spark Application?
Serialization Options
Some Additional Debugging Techniques