Infrastructure Integration¶
Configuration¶
-
Configure the agent by editing
/etc/nutanix/epoch-dd-agent/conf.d/spark.yaml
in the collectors. Example:init_config: instances: # # The Spark check can retrieve metrics from Standalone Spark, YARN and # Mesos. All methods require the `spark_url` to be configured. # # For Spark Standalone, `spark_url` must be set to the Spark master's web # UI. This is "http://localhost:8080" by default. # # For YARN, `spark_url` must be set to YARN's resource manager address. The # ResourceManager host name can be found in the yarn-site.xml conf file # under the `property yarn.resourcemanager.address` The ResourceManager port # can be found in the yarn-site.xml conf file under the property # `yarn.resourcemanager.webapp.address`. This is "http://localhost:8088" # by default. # # For Mesos, `spark_url` must be set to the Mesos master's web UI. This is # "http://<master_ip>:5050" by default, where `<master_ip>` is the IP # address or resolvable host name for the Mesos master. # # The use of `resourcemanager_uri` has been deprecated, but is still functional. - spark_url: http://localhost:8088 # To enable monitoring of a Standalone Spark cluster, the spark cluster # mode must be set. Uncomment the cluster mode that applies. # spark_cluster_mode: spark_yarn_mode # spark_cluster_mode: spark_standalone_mode # spark_cluster_mode: spark_mesos_mode # To use an older (versions prior to 2.0) Standalone Spark cluster, # the 'spark_pre_20_mode' must be set # spark_pre_20_mode: true # # If you have enabled the spark UI proxy, you may set this to `true` # spark_proxy_enabled: false # A Required friendly name for the cluster. # cluster_name: MySparkCluster # Optional tags to be applied to every emitted metric. # tags: # - key:value # - instance:production
-
Check and make sure that all yaml files are valid with following command:
/etc/init.d/epoch-collectors configcheck
-
Restart the Agent using the following command:
/etc/init.d/epoch-collectors restart
-
Execute the info command to verify that the integration check has passed:
/etc/init.d/epoch-collectors info
The output of the command should contain a section similar to the following:
Checks
======
[...]
spark
-----
- instance #0 [OK]
- Collected 8 metrics & 0 events
Infrastructure Datasources¶
Datasource | Available Aggregations | Unit | Description |
---|---|---|---|
spark.job.num_tasks | avg max min sum |
task/second | Number of tasks in the application |
spark.job.num_active_tasks | avg max min sum |
task/second | Number of active tasks in the application |
spark.job.num_skipped_tasks | avg max min sum |
task/second | Number of skipped tasks in the application |
spark.job.num_failed_tasks | avg max min sum |
task/second | Number of failed tasks in the application |
spark.job.num_active_stages | avg max min sum |
stage/second | Number of active stages in the application |
spark.job.num_completed_stages | avg max min sum |
stage/second | Number of completed stages in the application |
spark.job.num_skipped_stages | avg max min sum |
stage/second | Number of skipped stages in the application |
spark.job.num_failed_stages | avg max min sum |
stage/second | Number of failed stages in the application |
spark.stage.num_active_tasks | avg max min sum |
task/second | Number of active tasks in the application's stages |
spark.stage.num_complete_tasks | avg max min sum |
task/second | Number of complete tasks in the application's stages |
spark.stage.num_failed_tasks | avg max min sum |
task/second | Number of failed tasks in the application's stages |
spark.stage.executor_run_time | avg max min sum |
fraction | Fraction of time (ms/s) spent by the executor in the application's stages |
spark.stage.input_bytes | avg max min sum |
byte/second | Input bytes in the application's stages |
spark.stage.input_records | avg max min sum |
record/second | Input records in the application's stages |
spark.stage.output_bytes | avg max min sum |
byte/second | Output bytes in the application's stages |
spark.stage.output_records | avg max min sum |
record/second | Output records in the application's stages |
spark.stage.shuffle_read_bytes | avg max min sum |
byte/second | Number of bytes read during a shuffle in the application's stages |
spark.stage.shuffle_read_records | avg max min sum |
record/second | Number of records read during a shuffle in the application's stages |
spark.stage.shuffle_write_bytes | avg max min sum |
byte/second | Number of shuffled bytes in the application's stages |
spark.stage.shuffle_write_records | avg max min sum |
record/second | Number of shuffled records in the application's stages |
spark.stage.memory_bytes_spilled | avg max min sum |
byte/second | Number of bytes spilled to disk in the application's stages |
spark.stage.disk_bytes_spilled | avg max min sum |
byte/second | Max size on disk of the spilled bytes in the application's stages |
spark.driver.rdd_blocks | avg max min sum |
block/second | Number of RDD blocks in the driver |
spark.driver.memory_used | avg max min sum |
byte/second | Amount of memory used in the driver |
spark.driver.disk_used | avg max min sum |
byte/second | Amount of disk used in the driver |
spark.driver.active_tasks | avg max min sum |
task/second | Number of active tasks in the driver |
spark.driver.failed_tasks | avg max min sum |
task/second | Number of failed tasks in the driver |
spark.driver.completed_tasks | avg max min sum |
task/second | Number of completed tasks in the driver |
spark.driver.total_tasks | avg max min sum |
task/second | Number of total tasks in the driver |
spark.driver.total_duration | avg max min sum |
fraction | Fraction of time (ms/s) spent by the driver |
spark.driver.total_input_bytes | avg max min sum |
byte/second | Number of input bytes in the driver |
spark.driver.total_shuffle_read | avg max min sum |
byte/second | Number of bytes read during a shuffle in the driver |
spark.driver.total_shuffle_write | avg max min sum |
byte/second | Number of shuffled bytes in the driver |
spark.driver.max_memory | avg max min sum |
byte/second | Maximum memory used in the driver |
spark.executor.rdd_blocks | avg max min sum |
block/second | Number of persisted RDD blocks in the application's executors |
spark.executor.memory_used | avg max min sum |
byte/second | Amount of memory used for cached RDDs in the application's executors |
spark.executor.disk_used | avg max min sum |
byte/second | Amount of disk space used by persisted RDDs in the application's executors |
spark.executor.active_tasks | avg max min sum |
task/second | Number of active tasks in the application's executors |
spark.executor.failed_tasks | avg max min sum |
task/second | Number of failed tasks in the application's executors |
spark.executor.completed_tasks | avg max min sum |
task/second | Number of completed tasks in the application's executors |
spark.executor.total_tasks | avg max min sum |
task/second | Total number of tasks in the application's executors |
spark.executor.total_duration | avg max min sum |
fraction | Fraction of time (ms/s) spent by the application's executors executing tasks |
spark.executor.total_input_bytes | avg max min sum |
byte/second | Total number of input bytes in the application's executors |
spark.executor.total_shuffle_read | avg max min sum |
byte/second | Total number of bytes read during a shuffle in the application's executors |
spark.executor.total_shuffle_write | avg max min sum |
byte/second | Total number of shuffled bytes in the application's executors |
spark.executor_memory | avg max min sum |
byte/second | Maximum memory available for caching RDD blocks in the application's executors |
spark.rdd.num_partitions | avg max min sum |
/second | Number of persisted RDD partitions in the application |
spark.rdd.num_cached_partitions | avg max min sum |
/second | Number of in-memory cached RDD partitions in the application |
spark.rdd.memory_used | avg max min sum |
byte/second | Amount of memory used in the application's persisted RDDs |
spark.rdd.disk_used | avg max min sum |
byte/second | Amount of disk space used by persisted RDDs in the application |