azkaban-aplcache
Details
docs/index.rst 3(+2 -1)
diff --git a/docs/index.rst b/docs/index.rst
index 7086361..a6b4374 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -31,6 +31,7 @@ Features
:maxdepth: 2
:caption: Contents:
+
getStarted
configuration
userManager
@@ -39,7 +40,7 @@ Features
ajaxApi
howTo
plugins
-
+ jobTypes
*****
docs/jobTypes.rst 1217(+1217 -0)
diff --git a/docs/jobTypes.rst b/docs/jobTypes.rst
new file mode 100644
index 0000000..d4fdda9
--- /dev/null
+++ b/docs/jobTypes.rst
@@ -0,0 +1,1217 @@
+.. _Jobtypes:
+
+Jobtypes
+==================================
+
+Azkaban job type plugin design provides great flexibility for developers
+to create any type of job executors which can work with essentially all
+types of systems -- all managed and triggered by the core Azkaban work
+flow management.
+
+Here we provide a common set of plugins that should be useful to most
+hadoop related use cases, as well as sample job packages. Most of these
+job types are being used in LinkedIn's production clusters, only with
+different configurations. We also give a simple guide how one can create
+new job types, either from scratch or by extending the old ones.
+
+--------------
+
+*****
+Command Job Type (built-in)
+*****
+
+The command job type is one of the basic built-in types. It runs
+multiple UNIX commands using java processbuilder. Upon execution,
+Azkaban spawns off a process to run the command.
+
+*****
+How To Use
+*****
+
+One can run one or multiple commands within one command job. Here is
+what is needed:
+
++---------+-------------------------+
+| Type | Command |
++=========+=========================+
+| command | The full command to run |
++---------+-------------------------+
+
+For multiple commands, do it like ``command.1, command.2``, etc.
+
+.. raw:: html
+
+ <div class="bs-callout bs-callout-info">
+
+*****
+Sample Job Package
+*****
+
+Here is a sample job package, just to show how it works:
+
+`Download
+command.zip <https://s3.amazonaws.com/azkaban2/azkaban2/samplejobs/command.zip>`__
+(Uploaded May 13, 2013)
+
+..
+ Todo:: Re-Link this
+
+.. raw:: html
+
+ </div>
+
+--------------
+
+*****
+HadoopShell Job Type
+*****
+
+In large part, this is the same ``Command`` type. The difference is its
+ability to talk to a Hadoop cluster securely, via Hadoop tokens.
+
+The HadoopShell job type is one of the basic built-in types. It runs
+multiple UNIX commands using java processbuilder. Upon execution,
+Azkaban spawns off a process to run the command.
+
+
+*****
+How To Use
+*****
+
+The ``HadoopShell`` job type talks to a secure cluster via Hadoop
+tokens. The admin should specify ``obtain.binary.token=true`` if the
+Hadoop cluster security is turned on. Before executing a job, Azkaban
+will obtain name node token and job tracker tokens for this job. These
+tokens will be written to a token file, to be picked up by user job
+process during its execution. After the job finishes, Azkaban takes care
+of canceling these tokens from name node and job tracker.
+
+Since Azkaban only obtains the tokens at the beginning of the job run,
+and does not requesting new tokens or renew old tokens during the
+execution, it is important that the job does not run longer than
+configured token life.
+
+One can run one or multiple commands within one command job. Here is
+what is needed:
+
++---------+-------------------------+
+| Type | Command |
++=========+=========================+
+| command | The full command to run |
++---------+-------------------------+
+
+For multiple commands, do it like ``command.1, command.2``, etc.
+
+Here are some common configurations that make a ``hadoopShell`` job for
+a user:
+
++-----------------------------------+-----------------------------------+
+| Parameter | Description |
++===================================+===================================+
+| type | The type name as set by the |
+| | admin, e.g. ``hadoopShell`` |
++-----------------------------------+-----------------------------------+
+| dependencies | The other jobs in the flow this |
+| | job is dependent upon. |
++-----------------------------------+-----------------------------------+
+| user.to.proxy | The Hadoop user this job should |
+| | run under. |
++-----------------------------------+-----------------------------------+
+| hadoop-inject.FOO | FOO is automatically added to the |
+| | Configuration of any Hadoop job |
+| | launched. |
++-----------------------------------+-----------------------------------+
+
+Here are what's needed and normally configured by the admin:
+
++-----------------------------------+-----------------------------------+
+| Parameter | Description |
++===================================+===================================+
+| hadoop.security.manager.class | The class that handles talking to |
+| | Hadoop clusters. |
++-----------------------------------+-----------------------------------+
+| azkaban.should.proxy | Whether Azkaban should proxy as |
+| | individual user Hadoop accounts. |
++-----------------------------------+-----------------------------------+
+| proxy.user | The Azkaban user configured with |
+| | kerberos and Hadoop, for secure |
+| | clusters. |
++-----------------------------------+-----------------------------------+
+| proxy.keytab.location | The location of the keytab file |
+| | with which Azkaban can |
+| | authenticate with Kerberos for |
+| | the specified proxy.user |
++-----------------------------------+-----------------------------------+
+| obtain.binary.token | Whether Azkaban should request |
+| | tokens. Set this to true for |
+| | secure clusters. |
++-----------------------------------+-----------------------------------+
+
+--------------
+
+*****
+Java Job Type
+*****
+
+The ``java`` job type was widely used in the original Azkaban as a
+built-in type. It is no longer a built-in type in Azkaban2. The
+``javaprocess`` is still built-in in Azkaban2. The main difference
+between ``java`` and ``javaprocess`` job types are:
+
+#. ``javaprocess`` runs user program that has a "main" method, ``java``
+ runs Azkaban provided main method which invokes user program "run"
+ method.
+#. Azkaban can do the setup, such as getting Kerberos ticket or
+ requesting Hadoop tokens in the provided main in ``java`` type,
+ whereas in ``javaprocess`` user is responsible for everything.
+
+As a result, most users use ``java`` type for running anything that
+talks to Hadoop clusters. That usage should be replaced by
+``hadoopJava`` type now, which is secure. But we still keep ``java``
+type in the plugins for backwards compatibility.
+
+*****
+How to Use
+*****
+
+Azkaban spawns a local process for the java job type that runs user
+programs. It is different from the "javaprocess" job type in that
+Azkaban already provides a ``main`` method, called
+``JavaJobRunnerMain``. Inside ``JavaJobRunnerMain``, it looks for the
+``run`` method which can be specified by ``method.run`` (default is
+``run``). User can also specify a ``cancel`` method in the case the user
+wants to gracefully terminate the job in the middle of the run.
+
+For the most part, using ``java`` type should be no different from
+``hadoopJava``.
+
+.. raw:: html
+
+ <div class="bs-callout bs-callout-info">
+
+*****
+Sample Job
+*****
+
+Please refer to the `hadoopJava type <#hadoopjava-type>`_.
+
+.. raw:: html
+
+ </div>
+
+--------------
+
+*****
+hadoopJava Type
+*****
+
+
+In large part, this is the same ``java`` type. The difference is its
+ability to talk to a Hadoop cluster securely, via Hadoop tokens. Most
+Hadoop job types can be created by running a hadoopJava job, such as
+Pig, Hive, etc.
+
+*****
+How To Use
+*****
+
+
+The ``hadoopJava`` type runs user java program after all. Upon
+execution, it tries to construct an object that has the constructor
+signature of ``constructor(String, Props)`` and runs its ``run`` method.
+If user wants to cancel the job, it tries the user defined ``cancel``
+method before doing a hard kill on that process.
+
+The ``hadoopJava`` job type talks to a secure cluster via Hadoop tokens.
+The admin should specify ``obtain.binary.token=true`` if the Hadoop
+cluster security is turned on. Before executing a job, Azkaban will
+obtain name node token and job tracker tokens for this job. These tokens
+will be written to a token file, to be picked up by user job process
+during its execution. After the job finishes, Azkaban takes care of
+canceling these tokens from name node and job tracker.
+
+Since Azkaban only obtains the tokens at the beginning of the job run,
+and does not requesting new tokens or renew old tokens during the
+execution, it is important that the job does not run longer than
+configured token life.
+
+If there are multiple job submissions inside the user program, the user
+should also take care not to have a single MR step cancel the tokens
+upon completion, thereby failing all other MR steps when they try to
+authenticate with Hadoop services.
+
+In many cases, it is also necessary to add the following code to make
+sure user program picks up the Hadoop tokens in "conf" or "jobconf" like
+the following:
+
+.. code-block:: guess
+
+ // Suppose this is how one gets the conf
+ Configuration conf = new Configuration();
+
+ if (System.getenv("HADOOP_TOKEN_FILE_LOCATION") != null) {
+ conf.set("mapreduce.job.credentials.binary", System.getenv("HADOOP_TOKEN_FILE_LOCATION"));
+ }
+
+Here are some common configurations that make a ``hadoopJava`` job for a
+user:
+
++-----------------------------------+-----------------------------------+
+| Parameter | Description |
++===================================+===================================+
+| type | The type name as set by the |
+| | admin, e.g. ``hadoopJava`` |
++-----------------------------------+-----------------------------------+
+| job.class | The fully qualified name of the |
+| | user job class. |
++-----------------------------------+-----------------------------------+
+| classpath | The resources that should be on |
+| | the execution classpath, |
+| | accessible to the local |
+| | filesystem. |
++-----------------------------------+-----------------------------------+
+| main.args | Main arguments passed to user |
+| | program. |
++-----------------------------------+-----------------------------------+
+| dependencies | The other jobs in the flow this |
+| | job is dependent upon. |
++-----------------------------------+-----------------------------------+
+| user.to.proxy | The Hadoop user this job should |
+| | run under. |
++-----------------------------------+-----------------------------------+
+| method.run | The run method, defaults to |
+| | *run()* |
++-----------------------------------+-----------------------------------+
+| method.cancel | The cancel method, defaults to |
+| | *cancel()* |
++-----------------------------------+-----------------------------------+
+| getJobGeneratedProperties | The method user should implement |
+| | if the output properties should |
+| | be picked up and passed to the |
+| | next job. |
++-----------------------------------+-----------------------------------+
+| jvm.args | The ``-D`` for the new jvm |
+| | process |
++-----------------------------------+-----------------------------------+
+| hadoop-inject.FOO | FOO is automatically added to the |
+| | Configuration of any Hadoop job |
+| | launched. |
++-----------------------------------+-----------------------------------+
+
+Here are what's needed and normally configured by the admin:
+
++-----------------------------------+-----------------------------------+
+| Parameter | Description |
++===================================+===================================+
+| hadoop.security.manager.class | The class that handles talking to |
+| | Hadoop clusters. |
++-----------------------------------+-----------------------------------+
+| azkaban.should.proxy | Whether Azkaban should proxy as |
+| | individual user Hadoop accounts. |
++-----------------------------------+-----------------------------------+
+| proxy.user | The Azkaban user configured with |
+| | kerberos and Hadoop, for secure |
+| | clusters. |
++-----------------------------------+-----------------------------------+
+| proxy.keytab.location | The location of the keytab file |
+| | with which Azkaban can |
+| | authenticate with Kerberos for |
+| | the specified proxy.user |
++-----------------------------------+-----------------------------------+
+| hadoop.home | The Hadoop home where the jars |
+| | and conf resources are installed. |
++-----------------------------------+-----------------------------------+
+| jobtype.classpath | The items that every such job |
+| | should have on its classpath. |
++-----------------------------------+-----------------------------------+
+| jobtype.class | Should be set to |
+| | ``azkaban.jobtype.HadoopJavaJob`` |
++-----------------------------------+-----------------------------------+
+| obtain.binary.token | Whether Azkaban should request |
+| | tokens. Set this to true for |
+| | secure clusters. |
++-----------------------------------+-----------------------------------+
+
+Since Azkaban job types are named by their directory names, the admin
+should also make those naming public and consistent.
+
+.. raw:: html
+
+ <div class="bs-callout bs-callout-info">
+*****
+Sample Job Package
+*****
+
+Here is a sample job package that does a word count. It relies on a Pig
+job to first upload the text file onto HDFS. One can also manually
+upload a file and run the word count program alone.The source code is in
+``azkaban-plugins/plugins/jobtype/src/azkaban/jobtype/examples/java/WordCount.java``
+
+`Download
+java-wc.zip <https://s3.amazonaws.com/azkaban2/azkaban2/samplejobs/java-wc.zip>`__
+(Uploaded May 13, 2013)
+
+.. raw:: html
+
+ </div>
+
+--------------
+
+*****
+Pig Type
+*****
+
+
+Pig type is for running Pig jobs. In the ``azkaban-plugins`` repo, we
+have included Pig types from pig-0.9.2 to pig-0.11.0. It is up to the
+admin to alias one of them as the ``pig`` type for Azkaban users.
+
+Pig type is built on using hadoop tokens to talk to secure Hadoop
+clusters. Therefore, individual Azkaban Pig jobs are restricted to run
+within the token's lifetime, which is set by Hadoop admins. It is also
+important that individual MR step inside a single Pig script doesn't
+cancel the tokens upon its completion. Otherwise, all following steps
+will fail on authentication with job tracker or name node.
+
+Vanilla Pig types don't provide all udf jars. It is often up to the
+admin who sets up Azkaban to provide a pre-configured Pig job type with
+company specific udfs registered and name space imported, so that the
+users don't need to provide all the jars and do the configurations in
+their specific Pig job conf files.
+
+*****
+How to Use
+*****
+
+
+The Pig job runs user Pig scripts. It is important to remember, however,
+that running any Pig script might require a number of dependency
+libraries that need to be placed on local Azkaban job classpath, or be
+registered with Pig and carried remotely, or both. By using classpath
+settings, as well as ``pig.additional.jars`` and ``udf.import.list``,
+the admin can create a Pig job type that has very different default
+behavior than the most basic "pig" type. Pig jobs talk to a secure
+cluster via hadoop tokens. The admin should specify
+``obtain.binary.token=true`` if the hadoop cluster security is turned
+on. Before executing a job, Azkaban will obtain name node and job
+tracker tokens for this job. These tokens will be written to a token
+file, which will be picked up by user job process during its execution.
+For Hadoop 1 (``HadoopSecurityManager_H_1_0``), after the job finishes,
+Azkaban takes care of canceling these tokens from name node and job
+tracker. In Hadoop 2 (``HadoopSecurityManager_H_2_0``), due to issues
+with tokens being canceled prematurely, Azkaban does not cancel the
+tokens.
+
+Since Azkaban only obtains the tokens at the beginning of the job run,
+and does not request new tokens or renew old tokens during the
+execution, it is important that the job does not run longer than
+configured token life. It is also important that individual MR step
+inside a single Pig script doesn't cancel the tokens upon its
+completion. Otherwise, all following steps will fail on authentication
+with hadoop services. In Hadoop 2, you may need to set
+``-Dmapreduce.job.complete.cancel.delegation.tokens=false`` to prevent
+tokens from being canceled prematurely.
+
+Here are the common configurations that make a Pig job for a *user*:
+
++-----------------------------------+-----------------------------------+
+| Parameter | Description |
++===================================+===================================+
+| type | The type name as set by the |
+| | admin, e.g. ``pig`` |
++-----------------------------------+-----------------------------------+
+| pig.script | The Pig script location. e.g. |
+| | ``src/wordcountpig.pig`` |
++-----------------------------------+-----------------------------------+
+| classpath | The resources that should be on |
+| | the execution classpath, |
+| | accessible to the local |
+| | filesystem. |
++-----------------------------------+-----------------------------------+
+| dependencies | The other jobs in the flow this |
+| | job is dependent upon. |
++-----------------------------------+-----------------------------------+
+| user.to.proxy | The hadoop user this job should |
+| | run under. |
++-----------------------------------+-----------------------------------+
+| pig.home | The Pig installation directory. |
+| | Can be used to override the |
+| | default set by Azkaban. |
++-----------------------------------+-----------------------------------+
+| param.SOME_PARAM | Equivalent to Pig's ``-param`` |
++-----------------------------------+-----------------------------------+
+| use.user.pig.jar | If true, will use the |
+| | user-provided Pig jar to launch |
+| | the job. If false, the Pig jar |
+| | provided by Azkaban will be used. |
+| | Defaults to false. |
++-----------------------------------+-----------------------------------+
+| hadoop-inject.FOO | FOO is automatically added to the |
+| | Configuration of any Hadoop job |
+| | launched. |
++-----------------------------------+-----------------------------------+
+
+Here are what's needed and normally configured by the admin:
+
++-----------------------------------+-----------------------------------+
+| Parameter | Description |
++===================================+===================================+
+| hadoop.security.manager.class | The class that handles talking to |
+| | hadoop clusters. |
++-----------------------------------+-----------------------------------+
+| azkaban.should.proxy | Whether Azkaban should proxy as |
+| | individual user hadoop accounts. |
++-----------------------------------+-----------------------------------+
+| proxy.user | The Azkaban user configured with |
+| | kerberos and hadoop, for secure |
+| | clusters. |
++-----------------------------------+-----------------------------------+
+| proxy.keytab.location | The location of the keytab file |
+| | with which Azkaban can |
+| | authenticate with Kerberos for |
+| | the specified proxy.user |
++-----------------------------------+-----------------------------------+
+| hadoop.home | The hadoop home where the jars |
+| | and conf resources are installed. |
++-----------------------------------+-----------------------------------+
+| jobtype.classpath | The items that every such job |
+| | should have on its classpath. |
++-----------------------------------+-----------------------------------+
+| jobtype.class | Should be set to |
+| | ``azkaban.jobtype.HadoopJavaJob`` |
++-----------------------------------+-----------------------------------+
+| obtain.binary.token | Whether Azkaban should request |
+| | tokens. Set this to true for |
+| | secure clusters. |
++-----------------------------------+-----------------------------------+
+
+Dumping MapReduce Counters: this is useful in the case where a Pig
+script uses UDFs, which may add a few custom MapReduce counters
+
++-----------------------------------+-----------------------------------+
+| Parameter | Description |
++===================================+===================================+
+| pig.dump.hadoopCounter | Setting the value of this |
+| | parameter to true will trigger |
+| | the dumping of MapReduce counters |
+| | for each of the generated |
+| | MapReduce job generated by the |
+| | Pig script. |
++-----------------------------------+-----------------------------------+
+
+Since Pig jobs are essentially Java programs, the configurations for
+Java jobs could also be set.
+
+Since Azkaban job types are named by their directory names, the admin
+should also make those naming public and consistent. For example, while
+there are multiple versions of Pig job types, the admin can link one of
+them as ``pig`` for default Pig type. Experimental Pig versions can be
+tested in parallel with a different name and can be promoted to default
+Pig type if it is proven stable. In LinkedIn, we also provide Pig job
+types that have a number of useful udf libraries, including datafu and
+LinkedIn specific ones, pre-registered and imported, so that users in
+most cases will only need Pig scripts in their Azkaban job packages.
+
+.. raw:: html
+
+ <div class="bs-callout bs-callout-info">
+
+*****
+Sample Job Package
+*****
+
+
+Here is a sample job package that does word count. It assumes you have
+hadoop installed and gets some dependency jars from ``$HADOOP_HOME``:
+
+`Download
+pig-wc.zip <https://s3.amazonaws.com/azkaban2/azkaban2/samplejobs/pig-wc.zip>`__
+(Uploaded May 13, 2013)
+
+.. raw:: html
+
+ </div>
+
+--------------
+
+*****
+Hive Type
+*****
+
+The ``hive`` type is for running Hive jobs. In the
+`azkaban-plugins <https://github.com/azkaban/azkaban-plugins>`__ repo,
+we have included hive type based on hive-0.8.1. It should work for
+higher version Hive versions as well. It is up to the admin to alias one
+of them as the ``hive`` type for Azkaban users.
+
+The ``hive`` type is built using Hadoop tokens to talk to secure Hadoop
+clusters. Therefore, individual Azkaban Hive jobs are restricted to run
+within the token's lifetime, which is set by Hadoop admin. It is also
+important that individual MR step inside a single Pig script doesn't
+cancel the tokens upon its completion. Otherwise, all following steps
+will fail on authentication with the JobTracker or NameNode.
+
+*****
+How to Use
+*****
+
+The Hive job runs user Hive queries. The Hive job type talks to a secure
+cluster via Hadoop tokens. The admin should specify
+``obtain.binary.token=true`` if the Hadoop cluster security is turned
+on. Before executing a job, Azkaban will obtain NameNode and JobTracker
+tokens for this job. These tokens will be written to a token file, which
+will be picked up by user job process during its execution. After the
+job finishes, Azkaban takes care of canceling these tokens from NameNode
+and JobTracker.
+
+Since Azkaban only obtains the tokens at the beginning of the job run,
+and does not request new tokens or renew old tokens during the
+execution, it is important that the job does not run longer than
+configured token life. It is also important that individual MR step
+inside a single Pig script doesn't cancel the tokens upon its
+completion. Otherwise, all following steps will fail on authentication
+with Hadoop services.
+
+Here are the common configurations that make a ``hive`` job for single
+line Hive query:
+
++-----------------+--------------------------------------------------+
+| Parameter | Description |
++=================+==================================================+
+| type | The type name as set by the admin, e.g. ``hive`` |
++-----------------+--------------------------------------------------+
+| azk.hive.action | use ``execute.query`` |
++-----------------+--------------------------------------------------+
+| hive.query | Used for single line hive query. |
++-----------------+--------------------------------------------------+
+| user.to.proxy | The hadoop user this job should run under. |
++-----------------+--------------------------------------------------+
+
+Specify these for a multi-line Hive query:
+
++-----------------+-------------------------------------------------------+
+| Parameter | Description |
++=================+=======================================================+
+| type | The type name as set by the admin, e.g. ``hive`` |
++-----------------+-------------------------------------------------------+
+| azk.hive.action | use ``execute.query`` |
++-----------------+-------------------------------------------------------+
+| hive.query.01 | fill in the individual hive queries, starting from 01 |
++-----------------+-------------------------------------------------------+
+| user.to.proxy | The Hadoop user this job should run under. |
++-----------------+-------------------------------------------------------+
+
+Specify these for query from a file:
+
++-----------------+--------------------------------------------------+
+| Parameter | Description |
++=================+==================================================+
+| type | The type name as set by the admin, e.g. ``hive`` |
++-----------------+--------------------------------------------------+
+| azk.hive.action | use ``execute.query`` |
++-----------------+--------------------------------------------------+
+| hive.query.file | location of the query file |
++-----------------+--------------------------------------------------+
+| user.to.proxy | The Hadoop user this job should run under. |
++-----------------+--------------------------------------------------+
+
+Here are what's needed and normally configured by the admin. The
+following properties go into private.properties:
+
++-----------------------------------+-----------------------------------+
+| Parameter | Description |
++===================================+===================================+
+| hadoop.security.manager.class | The class that handles talking to |
+| | hadoop clusters. |
++-----------------------------------+-----------------------------------+
+| azkaban.should.proxy | Whether Azkaban should proxy as |
+| | individual user hadoop accounts. |
++-----------------------------------+-----------------------------------+
+| proxy.user | The Azkaban user configured with |
+| | kerberos and hadoop, for secure |
+| | clusters. |
++-----------------------------------+-----------------------------------+
+| proxy.keytab.location | The location of the keytab file |
+| | with which Azkaban can |
+| | authenticate with Kerberos for |
+| | the specified proxy.user |
++-----------------------------------+-----------------------------------+
+| hadoop.home | The hadoop home where the jars |
+| | and conf resources are installed. |
++-----------------------------------+-----------------------------------+
+| jobtype.classpath | The items that every such job |
+| | should have on its classpath. |
++-----------------------------------+-----------------------------------+
+| jobtype.class | Should be set to |
+| | ``azkaban.jobtype.HadoopJavaJob`` |
++-----------------------------------+-----------------------------------+
+| obtain.binary.token | Whether Azkaban should request |
+| | tokens. Set this to true for |
+| | secure clusters. |
++-----------------------------------+-----------------------------------+
+| hive.aux.jars.path | Where to find auxiliary library |
+| | jars |
++-----------------------------------+-----------------------------------+
+| env.HADOOP_HOME | ``$HADOOP_HOME`` |
++-----------------------------------+-----------------------------------+
+| env.HIVE_HOME | ``$HIVE_HOME`` |
++-----------------------------------+-----------------------------------+
+| env.HIVE_AUX_JARS_PATH | ``${hive.aux.jars.path}`` |
++-----------------------------------+-----------------------------------+
+| hive.home | ``$HIVE_HOME`` |
++-----------------------------------+-----------------------------------+
+| hive.classpath.items | Those that needs to be on hive |
+| | classpath, include the conf |
+| | directory |
++-----------------------------------+-----------------------------------+
+
+These go into plugin.properties
+
++-----------------------------------+-----------------------------------+
+| Parameter | Description |
++===================================+===================================+
+| job.class | ``azkaban.jobtype.hiveutils.azkab |
+| | an.HiveViaAzkaban`` |
++-----------------------------------+-----------------------------------+
+| hive.aux.jars.path | Where to find auxiliary library |
+| | jars |
++-----------------------------------+-----------------------------------+
+| env.HIVE_HOME | ``$HIVE_HOME`` |
++-----------------------------------+-----------------------------------+
+| env.HIVE_AUX_JARS_PATH | ``${hive.aux.jars.path}`` |
++-----------------------------------+-----------------------------------+
+| hive.home | ``$HIVE_HOME`` |
++-----------------------------------+-----------------------------------+
+| hive.jvm.args | ``-Dhive.querylog.location=.`` |
+| | ``-Dhive.exec.scratchdir=YOUR_HIV |
+| | E_SCRATCH_DIR`` |
+| | ``-Dhive.aux.jars.path=${hive.aux |
+| | .jars.path}`` |
++-----------------------------------+-----------------------------------+
+
+Since hive jobs are essentially java programs, the configurations for
+Java jobs could also be set.
+
+.. raw:: html
+
+ <div class="bs-callout bs-callout-info">
+
+.. rubric:: Sample Job Package
+ :name: sample-job-package-3
+
+Here is a sample job package. It assumes you have hadoop installed and
+gets some dependency jars from ``$HADOOP_HOME``. It also assumes you
+have Hive installed and configured correctly, including setting up a
+MySQL instance for Hive Metastore.
+
+`Download
+hive.zip <https://s3.amazonaws.com/azkaban2/azkaban2/samplejobs/hive.zip>`__
+(Uploaded May 13, 2013)
+
+.. raw:: html
+
+ </div>
+
+--------------
+
+.. rubric:: New Hive Jobtype
+ :name: new-hive-type
+
+We've added a new Hive jobtype whose jobtype class is
+``azkaban.jobtype.HadoopHiveJob``. The configurations have changed from
+the old Hive jobtype.
+
+Here are the configurations that a user can set:
+
++-----------------------------------+-----------------------------------+
+| Parameter | Description |
++===================================+===================================+
+| type | The type name as set by the |
+| | admin, e.g. ``hive`` |
++-----------------------------------+-----------------------------------+
+| hive.script | The relative path of your Hive |
+| | script inside your Azkaban zip |
++-----------------------------------+-----------------------------------+
+| user.to.proxy | The hadoop user this job should |
+| | run under. |
++-----------------------------------+-----------------------------------+
+| hiveconf.FOO | FOO is automatically added as a |
+| | hiveconf variable. You can |
+| | reference it in your script using |
+| | ${hiveconf:FOO}. These variables |
+| | also get added to the |
+| | configuration of any launched |
+| | Hadoop jobs. |
++-----------------------------------+-----------------------------------+
+| hivevar.FOO | FOO is automatically added as a |
+| | hivevar variable. You can |
+| | reference it in your script using |
+| | ${hivevar:FOO}. These variables |
+| | are NOT added to the |
+| | configuration of launched Hadoop |
+| | jobs. |
++-----------------------------------+-----------------------------------+
+| hadoop-inject.FOO | FOO is automatically added to the |
+| | Configuration of any Hadoop job |
+| | launched. |
++-----------------------------------+-----------------------------------+
+
+Here are what's needed and normally configured by the admin. The
+following properties go into private.properties (or into
+../commonprivate.properties):
+
++-----------------------------------+-----------------------------------+
+| Parameter | Description |
++===================================+===================================+
+| hadoop.security.manager.class | The class that handles talking to |
+| | hadoop clusters. |
++-----------------------------------+-----------------------------------+
+| azkaban.should.proxy | Whether Azkaban should proxy as |
+| | individual user hadoop accounts. |
++-----------------------------------+-----------------------------------+
+| proxy.user | The Azkaban user configured with |
+| | kerberos and hadoop, for secure |
+| | clusters. |
++-----------------------------------+-----------------------------------+
+| proxy.keytab.location | The location of the keytab file |
+| | with which Azkaban can |
+| | authenticate with Kerberos for |
+| | the specified proxy.user |
++-----------------------------------+-----------------------------------+
+| hadoop.home | The hadoop home where the jars |
+| | and conf resources are installed. |
++-----------------------------------+-----------------------------------+
+| jobtype.classpath | The items that every such job |
+| | should have on its classpath. |
++-----------------------------------+-----------------------------------+
+| jobtype.class | Should be set to |
+| | ``azkaban.jobtype.HadoopHiveJob`` |
++-----------------------------------+-----------------------------------+
+| obtain.binary.token | Whether Azkaban should request |
+| | tokens. Set this to true for |
+| | secure clusters. |
++-----------------------------------+-----------------------------------+
+| obtain.hcat.token | Whether Azkaban should request |
+| | HCatalog/Hive Metastore tokens. |
+| | If true, the |
+| | HadoopSecurityManager will |
+| | acquire an HCatalog token. |
++-----------------------------------+-----------------------------------+
+| hive.aux.jars.path | Where to find auxiliary library |
+| | jars |
++-----------------------------------+-----------------------------------+
+| hive.home | ``$HIVE_HOME`` |
++-----------------------------------+-----------------------------------+
+
+These go into plugin.properties (or into ../common.properties):
+
++-----------------------------------+-----------------------------------+
+| Parameter | Description |
++===================================+===================================+
+| hive.aux.jars.path | Where to find auxiliary library |
+| | jars |
++-----------------------------------+-----------------------------------+
+| hive.home | ``$HIVE_HOME`` |
++-----------------------------------+-----------------------------------+
+| jobtype.jvm.args | ``-Dhive.querylog.location=.`` |
+| | ``-Dhive.exec.scratchdir=YOUR_HIV |
+| | E_SCRATCH_DIR`` |
+| | ``-Dhive.aux.jars.path=${hive.aux |
+| | .jars.path}`` |
++-----------------------------------+-----------------------------------+
+
+Since hive jobs are essentially java programs, the configurations for
+Java jobs can also be set.
+
+--------------
+
+*****
+Common Configurations
+*****
+
+
+This section lists out the configurations that are common to all job
+types
+
+*****
+other_namenodes
+*****
+
+
+This job property is useful for jobs that need to read data from or
+write data to more than one Hadoop NameNode. By default Azkaban requests
+a HDFS_DELEGATION_TOKEN on behalf of the job for the cluster that
+Azkaban is configured to run on. When this property is present, Azkaban
+will try request a HDFS_DELEGATION_TOKEN for each of the specified HDFS
+NameNodes.
+
+The value of this propety is in the form of comma separated list of
+NameNode URLs.
+
+For example: **other_namenodes=webhdfs://host1:50070,hdfs://host2:9000**
+
+*****
+HTTP Job Callback
+*****
+
+
+The purpose of this feature to allow Azkaban to notify external systems
+via an HTTP upon the completion of a job. The new properties are in the
+following format:
+
+- **job.notification.<status>.<sequence number>.url**
+- **job.notification.<status>.<sequence number>.method**
+- **job.notification.<status>.<sequence number>.body**
+- **job.notification.<status>.<sequence number>.headers**
+
+*****
+Supported values for **status**
+*****
+
+
+- **started**: when a job is started
+- **success**: when a job is completed successfully
+- **failure**: when a job failed
+- **completed**: when a job is either successfully completed or failed
+
+*****
+Number of callback URLs
+*****
+
+
+The maximum # of callback URLs per job is 3. So the <sequence number>
+can go up from 1 to 3. If a gap is detected, only the ones before the
+gap is used.
+
+*****
+HTTP Method
+*****
+
+
+The supported method are **GET** and **POST**. The default method is
+**GET**
+
+*****
+Headers
+*****
+
+
+Each job callback URL can optional specify headers in the following
+format
+
+**job.notification.<status>.<sequence
+number>.headers**\ =<name>:<value>\r\n<name>:<value>
+The delimiter for each header is '\r\n' and delimiter between header
+name and value is ':'
+
+The headers are applicable for both GET and POST job callback URLs.
+
+*****
+Job Context Information
+*****
+
+
+It is often desirable to include some dynamic context information about
+the job in the URL or POST request body, such as status, job name, flow
+name, execution id and project name. If the URL or POST request body
+contains any of the following tokens, they will be replaced with the
+actual values by Azkabn before making the HTTP callback is made. The
+value of each token will be HTTP encoded.
+
+- **?{server}** - Azkaban host name and port
+- **?{project}**
+- **?{flow}**
+- **?{executionId}**
+- **?{job}**
+- **?{status}** - possible values are started, failed, succeeded
+
+The value of these tokens will be HTTP encoded if they are on the URL,
+but will not be encoded when they are in the HTTP body.
+*****
+Examples
+*****
+
+
+GET HTTP Method
+
+- job.notification.started.1.url=http://abc.com/api/v2/message?text=wow!!&job=?{job}&status=?{status}
+- job.notification.completed.1.url=http://abc.com/api/v2/message?text=wow!!&job=?{job}&status=?{status}
+- job.notification.completed.2.url=http://abc.com/api/v2/message?text=yeah!!
+
+POST HTTP Method
+
+- job.notification.started.1.url=http://abc.com/api/v1/resource
+- job.notification.started.1.method=POST
+- job.notification.started.1.body={"type":"workflow",
+ "source":"Azkaban",
+ "content":"{server}:?{project}:?{flow}:?{executionId}:?{job}:?{status}"}
+- job.notification.started.1.headers=Content-type:application/json
+
+--------------
+
+*****
+VoldemortBuildandPush Type
+*****
+
+Pushing data from hadoop to voldemort store used to be entirely in java.
+This created lots of problems, mostly due to users having to keep track
+of jars and dependencies and keep them up-to-date. We created the
+``VoldemortBuildandPush`` job type to address this problem. Jars and
+dependencies are now managed by admins; absolutely no jars or java code
+are required from users.
+
+*****
+How to Use
+*****
+
+
+This is essentially a hadoopJava job, with all jars controlled by the
+admins. User only need to provide a .job file for the job and specify
+all the parameters. The following needs to be specified:
+
++-----------------------------------+-----------------------------------+
+| Parameter | Description |
++===================================+===================================+
+| type | The type name as set by the |
+| | admin, e.g. |
+| | ``VoldemortBuildandPush`` |
++-----------------------------------+-----------------------------------+
+| push.store.name | The voldemort push store name |
++-----------------------------------+-----------------------------------+
+| push.store.owners | The push store owners |
++-----------------------------------+-----------------------------------+
+| push.store.description | Push store description |
++-----------------------------------+-----------------------------------+
+| build.input.path | Build input path on hdfs |
++-----------------------------------+-----------------------------------+
+| build.output.dir | Build output path on hdfs |
++-----------------------------------+-----------------------------------+
+| build.replication.factor | replication factor number |
++-----------------------------------+-----------------------------------+
+| user.to.proxy | The hadoop user this job should |
+| | run under. |
++-----------------------------------+-----------------------------------+
+| build.type.avro | if build and push avro data, |
+| | true, otherwise, false |
++-----------------------------------+-----------------------------------+
+| avro.key.field | if using Avro data, key field |
++-----------------------------------+-----------------------------------+
+| avro.value.field | if using Avro data, value field |
++-----------------------------------+-----------------------------------+
+
+Here are what's needed and normally configured by the admn (always put
+common properties in ``commonprivate.properties`` and
+``common.properties`` for all job types).
+
+These go into ``private.properties``:
+
++-----------------------------------+-----------------------------------+
+| Parameter | Description |
++===================================+===================================+
+| hadoop.security.manager.class | The class that handles talking to |
+| | hadoop clusters. |
++-----------------------------------+-----------------------------------+
+| azkaban.should.proxy | Whether Azkaban should proxy as |
+| | individual user hadoop accounts. |
++-----------------------------------+-----------------------------------+
+| proxy.user | The Azkaban user configured with |
+| | kerberos and hadoop, for secure |
+| | clusters. |
++-----------------------------------+-----------------------------------+
+| proxy.keytab.location | The location of the keytab file |
+| | with which Azkaban can |
+| | authenticate with Kerberos for |
+| | the specified ``proxy.user`` |
++-----------------------------------+-----------------------------------+
+| hadoop.home | The hadoop home where the jars |
+| | and conf resources are installed. |
++-----------------------------------+-----------------------------------+
+| jobtype.classpath | The items that every such job |
+| | should have on its classpath. |
++-----------------------------------+-----------------------------------+
+| jobtype.class | Should be set to |
+| | ``azkaban.jobtype.HadoopJavaJob`` |
++-----------------------------------+-----------------------------------+
+| obtain.binary.token | Whether Azkaban should request |
+| | tokens. Set this to true for |
+| | secure clusters. |
++-----------------------------------+-----------------------------------+
+| azkaban.no.user.classpath | Set to true such that Azkaban |
+| | doesn't pick up user supplied |
+| | jars. |
++-----------------------------------+-----------------------------------+
+
+These go into ``plugin.properties``:
+
++-----------------------------------+-----------------------------------+
+| Parameter | Description |
++===================================+===================================+
+| job.class | ``voldemort.store.readonly.mr.azk |
+| | aban.VoldemortBuildAndPushJob`` |
++-----------------------------------+-----------------------------------+
+| voldemort.fetcher.protocol | ``webhdfs`` |
++-----------------------------------+-----------------------------------+
+| hdfs.default.classpath.dir | HDFS location for distributed |
+| | cache |
++-----------------------------------+-----------------------------------+
+| hdfs.default.classpath.dir.enable | set to true if using distributed |
+| | cache to ship dependency jars |
++-----------------------------------+-----------------------------------+
+
+.. raw:: html
+
+ <div class="bs-callout bs-callout-info">
+
+*****
+For more information
+*****
+
+
+Please refer to `Voldemort project
+site <http://project-voldemort.com/voldemort>`__ for more info.
+
+.. raw:: html
+
+ </div>
+
+--------------
+
+*****
+Create Your Own Jobtypes
+*****
+
+
+With plugin design of Azkaban job types, it is possible to extend
+Azkaban for various system environments. You should be able to execute
+any job under the same Azkaban work flow management and scheduling.
+
+Creating new job types is often times very easy. Here are several ways
+one can do it:
+
+*****
+New Types with only Configuration Changes
+*****
+
+
+One doesn't always need to write java code to create job types for end
+users. Often times, configuration changes of existing job types would
+create significantly different behavior to the end users. For example,
+in LinkedIn, apart from the *pig* types, we also have *pigLi* types that
+come with all the useful library jars pre-registered and imported. This
+way, normal users only need to provide their pig scripts, and the their
+own udf jars to Azkaban. The pig job should run as if it is run on the
+gateway machine from pig grunt. In comparison, if users are required to
+use the basic *pig* job types, they will need to package all the
+necessary jars in the Azkaban job package, and do all the register and
+import by themselves, which often poses some learning curve for new
+pig/Azkaban users.
+
+The same practice applies to most other job types. Admins should create
+or tailor job types to their specific company needs or clusters.
+
+*****
+New Types Using Existing Job Types
+*****
+
+
+If one needs to create a different job type, a good starting point is to
+see if this can be done by using an existing job type. In hadoop land,
+this most often means the hadoopJava type. Essentially all hadoop jobs,
+from the most basic mapreduce job, to pig, hive, crunch, etc, are java
+programs that submit jobs to hadoop clusters. It is usually straight
+forward to create a job type that takes user input and runs a hadoopJava
+job.
+
+For example, one can take a look at the VoldemortBuildandPush job type.
+It will take in user input such as which cluster to push to, voldemort
+store name, etc, and runs hadoopJava job that does the work. For end
+users though, this is a VoldemortBuildandPush job type with which they
+only need to fill out the ``.job`` file to push data from hadoop to
+voldemort stores.
+
+The same applies to the hive type.
+
+*****
+New Types by Extending Existing Ones
+*****
+
+For the most flexibility, one can always build new types by extending
+the existing ones. Azkaban uses reflection to load job types that
+implements the ``job`` interface, and tries to construct a sample object
+upon loading for basic testing. When executing a real job, Azkaban calls
+the ``run`` method to run the job, and ``cancel`` method to cancel it.
+
+For new hadoop job types, it is important to use the correct
+``hadoopsecuritymanager`` class, which is also included in
+``azkaban-plugins`` repo. This class handles talking to the hadoop
+cluster, and if needed, requests tokens for job execution or for name
+node communication.
+
+For better security, tokens should be requested in Azkaban main process
+and be written to a file. Before executing user code, the job type
+should implement a wrapper that picks up the token file, set it in the
+``Configuration`` or ``JobConf`` object. Please refer to
+``HadoopJavaJob`` and ``HadoopPigJob`` to see example usage.
+
+--------------
+
+*****
+System Statistics
+*****
+
+
+Azkaban server maintains certain system statistics and they be seen
+http:<host>:<port>/stats
+
+To enable this feature, add the following property
+"executor.metric.reports=true" to azkaban.properties
+
+Property "executor.metric.milisecinterval.default" controls the interval
+at which the metrics are collected at
+
+*****
+Statistic Types
+*****
+
+
++----------------------+------------------------------+
+| Metric Name | Description |
++======================+==============================+
+| NumFailedFlowMetric | Number of failed flows |
++----------------------+------------------------------+
+| NumRunningFlowMetric | Number of flows in the queue |
++----------------------+------------------------------+
+| NumQueuedFlowMetric | Number of flows in the queue |
++----------------------+------------------------------+
+| NumRunningJobMetric | Number of running jobs |
++----------------------+------------------------------+
+| NumFailedJobMetric | Number of failed jobs |
++----------------------+------------------------------+
+
+To change the statistic collection at run time, the following options
+are available
+
+- To change the time interval at which the specific type of statistics
+ are collected -
+ /stats?action=changeMetricInterval&metricName=NumRunningJobMetric&interval=60000
+- To change the duration at which the statistics are maintained
+ -/stats?action=changeCleaningInterval&interval=604800000
+- To change the number of data points to display -
+ /stats?action=changeEmitterPoints&numInstances=50
+- To enable the statistic collection - /stats?action=enableMetrics
+- To disable the statistic collection - /stats?action=disableMetrics
+
+--------------
+*****
+Reload Jobtypes
+*****
+
+When you want to make changes to your jobtype configurations or
+add/remove jobtypes, you can do so without restarting the executor
+server. You can reload all jobtype plugins as follows:
+
+.. code-block:: guess
+
+ curl http://localhost:EXEC_SERVER_PORT/executor?action=reloadJobTypePlugins
+
+