azkaban-aplcache

Delete unused method ExecutableFlowBase#reEnableDependents …

10/19/2018 8:03:49 PM

(#1976)

Juho Autio

Commit: f428a07

Tree: b31b339

Parents: f450b65

Make ActiveExecutors a singleton (#1992) It was only used …

10/18/2018 8:13:39 PM

by one class that's singleton itself, so no harm done for now. But this is how it should be, in case it will be injected to more than one instances later.

Juho Autio

Commit: f450b65

Tree: 969b5c5

Parents: 8e34dd8

Always use multi-executor mode (#1986) * Always use multi-executor …

10/18/2018 7:40:06 PM

mode

This to simplify the code.

This requires users to migrate to azkaban.use.multiple.executors=true if not already using it.
After this change Azkaban will refuse to start if the property is missing or if it's set to false.

* Fix wrong comment placement & other minor fixes

* Added TODO to eventually delete checkMultiExecutorMode

* Clean up usage of executor port constants

Juho Autio

Commit: 8e34dd8

Tree: a86ea83

Parents: 3fdb3da

Create a unit test for RunningExecutionsUpdater + minor refactoring …

10/18/2018 5:15:15 PM

(#1975)

* Create a unit test for RunningExecutionsUpdater

* Move update request code to ExecutorApiGateway

Better scope for RunningExecutionsUpdater & cleaner unit tests

* Added test updateExecutionsSucceeded() & used constant for "error"

Juho Autio

Commit: 3fdb3da

Tree: f83f923

Parents: cefb483

Change `coveralls` logging level (#1983)

10/17/2018 9:00:25 PM

Yuliy Gerchikov

Commit: cefb483

Tree: 220b735

Parents: cc3e12c

Allow more dispatch retries (#1953) Allow more dispatch …

10/16/2018 3:28:15 PM

attempts than number of active executors.

The configuration key azkaban.maxDispatchingErrors is thus respected without an upper cap.

Normally azkaban-web shouldn't ever fail dispatching executions to executors as long as there are active executors available. This change allows configuring a limit that is in practice high enough to keep retrying forever, until at least one responsive executor appears.

NOTE: The handling of dispatch errors should be improved to distinguish between retriable & non-retriable errors.

The dispatch call itself is simple: it contains the execution id and username. I can't imagine a case where the request would be syntactically invalid/incompatible with the executor.

However, If executor returns a response with "error" in it, the dispatch on ExecutorManager currently fails the dispatch, and keeps retrying until the give up condition is met. It shouldn't be like this in all cases.

For example:

if the error is about "already running" (on that executor), ExecutorManager should treat that as a successful dispatch (even though it may be that this can't ever happen in practice).
Actually for this case I think executor shouldn't even return an error response.
If the error reason is that the execution is not found in the DB when executor tries to load it (how could that happen though..?), ExecutorManager should just give up dispatching.
And so on.
Any way, cleaning up Azkaban DB manually from problematic executions like mentioned above can be left for the responsibility of the admin, if the admin chooses to configure a non-default value for maxDispatchingErrors. This PR doesn't have to deal with handling different dispatch error cases. It can be handled later.

But this is how I'd plan to do it:

If response is connection error -> keep retrying
If response is received but status is not HTTP 200 OK -> keep retrying
If response is "already running on this executor" -> treat as a success (implement this so that executor doesn't return an error in the first place)
If response is received with any other error -> give up after receiving this kind of error from all active executors
Additional note on retrying after dispatch failure with "is already running":

I'm afraid that this can happen with the current azkaban code (remains to be verified though):

azkaban-web tries to dispatch execution 123 to executor 1
azkaban-web crashes / is killed
executor 1 has started the execution 123 and it's running
azkaban-web starts again
azkaban-web fetches the queued executions from the DB
azkaban-web tries to dispatch execution 123 to the assigned executor 1
executor 1 returns an error
azkaban-web rolls back executor assignment of execution 123
azkaban-web dispaches execution 123 to executor 2
execution 123 is running on both executors 1 & 2 at the same time
Also fix is simple: don't return an error if execution is already running on the executor.

However even that fix is not bullet-proof. Azkaban-web could also fail to receive the response of a dispatch call because of a connection error for example. It would also then automatically try the next executor.. One option would be to have some cooldown period after a failed dispatch attempt and checking from DB if the execution is running before trying to dispatch on another executor? Seems hard to get this right without adding some proper locking though. Whew, I'm happy to realize that in our setup we typically have only 1 active executor at a time.

Juho Autio

Commit: cc3e12c

Tree: 1c7a2c7

Parents: f0b0af0

Clean up FlowRunnerManager (#1980) - fix broken refs in javadoc - …

10/16/2018 3:22:56 PM

remove redundant initial field values & make them final
- apply save actions plugin

Juho Autio

Commit: f0b0af0

Tree: b927d31

Parents: aac4cdc

Ignore Flaky test FlowTriggerServiceTest & slow tests (#1982)

10/16/2018 2:29:14 PM

Juho Autio

Commit: aac4cdc

Tree: 9b50679

Parents: cd075f2

Enhancement to reportal presto jobtype (#1977) This PR adds …

10/15/2018 8:51:33 PM

1. variable support to reportal presto jobtype 2. support of presto query with trailing semicolon.

Cheng Ren

Commit: cd075f2

Tree: f2649e2

Parents: 6c9fb2b

Apply save actions on ExecutorManager & RunningExecutions …

10/15/2018 4:49:55 PM

(#1974)

Juho Autio

Commit: 6c9fb2b

Tree: 1bf59df

Parents: 26bc42c