Edit

Configuring Memory for Apache Spark on Big Data Cloud

Postach.io spark published

The default memory settings in Big Data Cloud are rather reserved. If you try to load a table of any size if will fail. It will also fail if you try to run a streaming job. This is not for lack of power, as even the smallest single-node cluster has plenty. To make use of this compute power all that is needed are a few tweaks to the Spark Configurations in Ambari.

Credit for these discoveries goes to my colleague David Bayard, who is quite possibly a wizzard.

TL;DR

You’ve got to adjust the spark configurations and the queueing as necessary for your use-case.

Config

  1. Custom spark2-thrift-sparkconf, spark.sql.shuffle.partitions=4
  2. Advanced Spark2-env, spark_daemon_memory=4096MB
  3. Advanced Spark2-env, SPARK_EXECUTOR_MEMORY=“2G"
# if you are getting kyro buffer errors, try these:
spark.kryoserializer.buffer=512m
spark.kryoserializer.buffer.max=2047m
SPARK_EXECUTOR_MEMORY=“4G"
# If you are getting errors about the driver not being able to collect all the executors' data, try this:
spark.sql.thriftServer.incrementalCollect=true

Queueing

See the docs, https://docs.oracle.com/en/cloud/paas/big-data-compute-cloud/csspc/managing-work-queue-capacity.html

For running large jobs on tiny clusters I always power up the default queue. If you want to do work with notebooks or push-down aggregations from OAC to spark then you’ll need to increase the capacity of the interactive or api queue, respectively.

Alt text

Long steps:

  1. Connect to Ambari
  2. Go to Spark2 on the left-hand list of services
  3. Then click on the Configs tab
  4. Navigate down to the Custom spark2-thrift-sparkconf section
  5. In the Custom spark2-thrift-sparkconf section, click on the “Add Property…" link and then add the property spark.sql.shuffle.partitions=4 and click Add.
  6. Expand the Advanced spark2-env section and change spark_daemon_memory to 4096 MB.
  7. Also in the Advanced spark2-env section in the “content" field, edit and uncomment the line about SPARK_EXECUTOR_MEMORY. When finished, it should read:
    SPARK_EXECUTOR_MEMORY=“2G"
  8. Click Save at the top of the screen.
  9. In the notes field, enter “memory"
  10. Click save again
  11. If you see a “Configurations" pop-up, click “Proceed Anyway"
  12. Click OK to acknowledge that changes were made successfully
  13. Then click Restart, then Restart All Affected
  14. Then click Confirm Restart All
%23%20Configuring%20Memory%20for%20Apache%20Spark%20on%20Big%20Data%20Cloud%0A@%28Postach.io%29%5Bspark%2C%20published%5D%0A%0A%0AThe%20default%20memory%20settings%20in%20Big%20Data%20Cloud%20are%20rather%20reserved.%20If%20you%20try%20to%20load%20a%20table%20of%20any%20size%20if%20will%20fail.%20It%20will%20also%20fail%20if%20you%20try%20to%20run%20a%20streaming%20job.%20This%20is%20not%20for%20lack%20of%20power%2C%20as%20even%20the%20smallest%20single-node%20cluster%20has%20plenty.%20To%20make%20use%20of%20this%20compute%20power%20all%20that%20is%20needed%20are%20a%20few%20tweaks%20to%20the%20Spark%20Configurations%20in%20Ambari.%20%0A%0ACredit%20for%20these%20discoveries%20goes%20to%20my%20colleague%20David%20Bayard%2C%20who%20is%20quite%20possibly%20a%20wizzard.%20%0A%0A%23%23%20TL%3BDR%20%0A%0AYou%27ve%20got%20to%20adjust%20the%20spark%20configurations%20and%20the%20queueing%20as%20necessary%20for%20your%20use-case.%20%0A%0A%23%23%23%23%20Config%0A1.%20Custom%20spark2-thrift-sparkconf%2C%20%60spark.sql.shuffle.partitions%3D4%60%0A2.%20Advanced%20Spark2-env%2C%20%60spark_daemon_memory%3D4096MB%60%0A3.%20Advanced%20Spark2-env%2C%20%60SPARK_EXECUTOR_MEMORY%3D%u201C2G%u201D%60%0A%0A%60%60%60plain%0A%23%20if%20you%20are%20getting%20kyro%20buffer%20errors%2C%20try%20these%3A%20%0Aspark.kryoserializer.buffer%3D512m%0Aspark.kryoserializer.buffer.max%3D2047m%0ASPARK_EXECUTOR_MEMORY%3D%u201C4G%u201D%0A%20%0A%23%20If%20you%20are%20getting%20errors%20about%20the%20driver%20not%20being%20able%20to%20collect%20all%20the%20executors%27%20data%2C%20try%20this%3A%0Aspark.sql.thriftServer.incrementalCollect%3Dtrue%0A%0A%60%60%60%0A%0A%23%23%23%23%20Queueing%20%0A%0ASee%20the%20docs%2C%20https%3A//docs.oracle.com/en/cloud/paas/big-data-compute-cloud/csspc/managing-work-queue-capacity.html%20%0A%0AFor%20running%20large%20jobs%20on%20tiny%20clusters%20I%20always%20power%20up%20the%20default%20queue.%20If%20you%20want%20to%20do%20work%20with%20notebooks%20or%20push-down%20aggregations%20from%20OAC%20to%20spark%20then%20you%27ll%20need%20to%20increase%20the%20capacity%20of%20the%20interactive%20or%20api%20queue%2C%20respectively.%20%0A%0A%21%5BAlt%20text%5D%28./Screen%20Shot%202018-08-30%20at%201.39.35%20PM.png%29%0A%0A%0A%23%23%20Long%20steps%3A%0A%0A1.%20Connect%20to%20Ambari%0A2.%20Go%20to%20*Spark2*%20on%20the%20left-hand%20list%20of%20services%0A3.%20Then%20click%20on%20the%20*Configs*%20tab%0A4.%20Navigate%20down%20to%20the%20*Custom%20spark2-thrift-sparkconf*%20section%0A5.%20In%20the%20*Custom%20spark2-thrift-sparkconf*%20section%2C%20click%20on%20the%20%u201CAdd%20Property%u2026%u201D%20link%20and%20then%20add%20the%20property%20%60spark.sql.shuffle.partitions%3D4%60%20and%20click%20Add.%0A6.%20Expand%20the%20*Advanced%20spark2-env*%20section%20and%20change%20%60spark_daemon_memory%60%20to%20%604096%20MB%60.%0A7.%20Also%20in%20the%20*Advanced%20spark2-env*%20section%20in%20the%20%u201Ccontent%u201D%20field%2C%20edit%20and%20uncomment%20the%20line%20about%20%60SPARK_EXECUTOR_MEMORY%60.%20When%20finished%2C%20it%20should%20read%3A%0A%60SPARK_EXECUTOR_MEMORY%3D%u201C2G%u201D%60%0A8.%20Click%20Save%20at%20the%20top%20of%20the%20screen.%0A9.%20In%20the%20notes%20field%2C%20enter%20%u201Cmemory%u201D%0A10.%20Click%20save%20again%0A11.%20If%20you%20see%20a%20%u201CConfigurations%u201D%20pop-up%2C%20click%20%u201CProceed%20Anyway%u201D%0A12.%20Click%20OK%20to%20acknowledge%20that%20changes%20were%20made%20successfully%0A13.%20Then%20click%20Restart%2C%20then%20Restart%20All%20Affected%0A14.%20Then%20click%20Confirm%20Restart%20All%0A