Apache Spark
Learn about using Sentry with Apache Spark.
The Spark Integration adds support for the Python API for Apache Spark, PySpark.
This integration is experimental and in an alpha state. The integration API may experience breaking changes in further minor versions.
The Spark driver integration is supported for Spark 2 and above.
To configure the SDK, initialize it with the integration before you create a SparkContext
or SparkSession
.
In addition to capturing errors, you can monitor interactions between multiple services or applications by enabling tracing. You can also collect and analyze performance profiles from real users with profiling.
Select which Sentry features you'd like to install in addition to Error Monitoring to get the corresponding installation and configuration instructions below.
import sentry_sdk
from sentry_sdk.integrations.spark import SparkIntegration
if __name__ == "__main__":
sentry_sdk.init(
dsn="https://examplePublicKey@o0.ingest.sentry.io/0",
# Set traces_sample_rate to 1.0 to capture 100%
# of transactions for tracing.
traces_sample_rate=1.0,
# Set profiles_sample_rate to 1.0 to profile 100%
# of sampled transactions.
# We recommend adjusting this value in production.
profiles_sample_rate=1.0,
integrations=[
SparkIntegration(),
],
)
spark = SparkSession\
.builder\
.appName("ExampleApp")\
.getOrCreate()
...
The spark worker integration is supported for Spark versions 2.4.x and 3.1.x.
Create a file called sentry-daemon.py
with the following content:
sentry-daemon.py
import sentry_sdk
from sentry_sdk.integrations.spark import SparkWorkerIntegration
import pyspark.daemon as original_daemon
if __name__ == '__main__':
sentry_sdk.init(
dsn="https://examplePublicKey@o0.ingest.sentry.io/0",
# Set traces_sample_rate to 1.0 to capture 100%
# of transactions for tracing.
traces_sample_rate=1.0,
# Set profiles_sample_rate to 1.0 to profile 100%
# of sampled transactions.
# We recommend adjusting this value in production.
profiles_sample_rate=1.0,
integrations=[
SparkWorkerIntegration(),
],
)
original_daemon.manager()
...
In your spark_submit
command, add the following configuration options so the spark clusters can use the Sentry integration.
Command Line Options | Parameter | Usage |
---|---|---|
--py-files | sentry_daemon.py | Sends the sentry_daemon.py file to your Spark clusters |
--conf | spark.python.use.daemon=true | Configures Spark to use a daemon to execute its Python workers |
--conf | spark.python.daemon.module=sentry_daemon | Configures Spark to use the Sentry custom daemon |
./bin/spark-submit \
--py-files sentry_daemon.py \
--conf spark.python.use.daemon=true \
--conf spark.python.daemon.module=sentry_daemon \
example-spark-job.py
- You must have the Sentry Python SDK installed on all your clusters to use the Spark integration. The easiest way to do this is to run an initialization script on all your clusters:
easy_install pip
pip install --upgrade sentry-sdk
In order to access certain tags (
app_name
,application_id
), the worker integration requires the driver integration to also be active.The worker integration only works on UNIX-based systems due to the daemon process using signals for child management.
This integration can be set up for Google Cloud Dataproc. It's recommended that Cloud Dataproc image version 1.4 or 2.0 be used with Spark 2.4 and 3.1, respectively, (as required by the worker integration).
Set up an Initialization action to install the
sentry-sdk
on your Dataproc cluster.Add the driver integration to your main python file submitted in in the job submit screen
Add the
sentry_daemon.py
under Additional python files in the job submit screen. You must first upload the daemon file to a bucket to access it.Add the configuration properties listed above,
spark.python.use.daemon=true
andspark.python.daemon.module=sentry_daemon
in the job submit screen.
Our documentation is open source and available on GitHub. Your contributions are welcome, whether fixing a typo (drat!) or suggesting an update ("yeah, this would be better").