In this tutorial page we describe how to execute SAMOA on top of Apache Storm. Here is an outline of what we want to do:
- Ensure that you have necessary Storm cluster and configuration to execute SAMOA
- Ensure that you have all the SAMOA deployables for execution in the cluster
- Configure samoa-storm.properties
- Execute SAMOA classification task
- Observe the task execution
Before we start the tutorial, please ensure that you already have Storm cluster (preferably Storm 0.8.2) running. You can follow this tutorial to set up a Storm cluster.
You also need to install Storm at the machine where you initiate the deployment, and configure Storm (at least) with this configuration in
########### These MUST be filled in for a storm configuration nimbus.host: "<enter your nimbus host name here>" ## List of custom serializations kryo.register: - org.apache.samoa.learners.classifiers.trees.AttributeContentEvent: org.apache.samoa.learners.classifiers.trees.AttributeContentEvent$AttributeCEFullPrecSerializer - org.apache.samoa.learners.classifiers.trees.ComputeContentEvent: org.apache.samoa.learners.classifiers.trees.ComputeContentEvent$ComputeCEFullPrecSerializer
Alternatively, if you don't have Storm cluster running, you can execute SAMOA with Storm in local mode as explained in section samoa-storm.properties Configuration.
There are three deployables for executing SAMOA on top of Storm. They are:
bin/samoais the main script to execute SAMOA. You do not need to change anything in this script.
target/SAMOA-Storm-x.x.x-SNAPSHOT.jaris the deployed jar file.
x.x.xis the version number of SAMOA.
bin/samoa-storm.propertiescontains deployment configurations. You need to set the parameters in this properties file correctly.
Currently, the properties file contains two configurations:
samoa.storm.modedetermines whether the task is executed locally (using Storm's
LocalCluster) or executed in a Storm cluster. Use
localif you want to test SAMOA and you do not have a Storm cluster for deployment. Use
clusterif you want to test SAMOA on your Storm cluster.
samoa.storm.numworkerdetermines the number of worker to execute the SAMOA tasks in the Storm cluster. This field must be an integer, less than or equal to the number of available slots in you Storm cluster. If you are using local mode, this property corresponds to the number of thread used by Storm's LocalCluster to execute your SAMOA task.
Here is the example of a complete properties file:
# SAMOA Storm properties file # This file contains specific configurations for SAMOA deployment in the Storm platform # Note that you still need to configure Storm client in your machine, # including setting up Storm configuration file (~/.storm/storm.yaml) with correct settings # samoa.storm.mode corresponds to the execution mode of the Task in Storm # possible values: # 1. cluster: the Task will be sent into nimbus. The nimbus is configured by Storm configuration file # 2. local: the Task will be sent using local Storm cluster samoa.storm.mode=cluster # samoa.storm.numworker corresponds to the number of worker processes allocated in Storm cluster # possible values: any integer greater than 0 samoa.storm.numworker=7
SAMOA task execution
You can execute a SAMOA task using the aforementioned
bin/samoa script with this following format:
bin/samoa <platform> <jar> "<task>".
<platform> can be
storm option means you are deploying SAMOA on a Storm environment. In this configuration, the script uses the aforementioned yaml file (
samoa-storm.properties to perform the deployment. Using
s4 option means you are deploying SAMOA on an Apache S4 environment. Follow this link to learn more about deploying SAMOA on Apache S4.
<jar> is the location of the deployed jar file (
SAMOA-Storm-x.x.x-SNAPSHOT.jar) in your file system. The location can be a relative path or an absolute path into the jar file.
"<task>" is the SAMOA task command line such as
ClusteringTask. This command line for SAMOA task follows the format of Massive Online Analysis (MOA).
The complete command to execute SAMOA is:
bin/samoa storm target/SAMOA-Storm-0.0.1-SNAPSHOT.jar "PrequentialEvaluation -d /tmp/dump.csv -i 1000000 -f 100000 -l (org.apache.samoa.learners.classifiers.trees.VerticalHoeffdingTree -p 4) -s (org.apache.samoa.moa.streams.generators.RandomTreeGenerator -c 2 -o 10 -u 10)"
Observing task execution
There are two ways to observe the task execution using Storm UI and by monitoring the dump file of the SAMOA task. Notice that the dump file will be created on the cluster if you are executing your task in
Using Storm UI
Go to the web address of Storm UI and check whether the SAMOA task executes as intended. Use this UI to kill the associated Storm topology if necessary.
Monitoring the dump file
Several tasks have options to specify a dump file, which is a file that represents the task output. In our example, Prequential Evaluation task has
-d option which specifies the path to the dump file. Since Storm performs the allocation of Storm tasks, you should set the dump file into a file on a shared filesystem if you want to access it from the machine submitting the task.