azkaban-aplcache

<< < 3 4 5 6 7 > >>

Minor CSS fixes and JS improvements (#2031)

11/19/2018 4:16:16 PM

Yeni Bermudez

Commit: c5f418b

Tree: 836b910

Parents: 5e0b90b

Improve dispatch request handling of a previously submitted …

11/16/2018 11:27:44 PM

execution (#2023)

* Extract method submitFlowRunner

* Extract method createFlowRunner

* Improve dispatch request handling of a previously submitted execution

- If the execution is indeed running, return OK so that dispatcher knows that it was successfully dispatched
- If the execution was left in some intermediate state, return an error so that dispatcher knows to retry or finalize the execution as failed

Juho Autio

Commit: 5e0b90b

Tree: 61e431d

Parents: bcbb639

Refactor hadoop token fetch logic follow-up (#2028) This …

11/15/2018 11:45:28 PM

PR:
1. adds more logging and standardize existing logging for each token prefetching methods so that we know prefetching from which service is stuck.

2. removes "synchronized" for doPrefetch(HadoopSecurityManager_H_2_0#doPrefetch). Current design makes it hard to debug which token service the job is stuck with fetching token. Since HadoopSecurityManager_H_2_0 is shared by all jobs in the executor, if one job is stuck with fetching token with a problematic token service, all other jobs will be blocked from entering into this synchronized method. It's impossible to infer which token service jobs are stuck with from job logs as they are just waiting for one job to finish fetching token.

Cheng Ren

Commit: bcbb639

Tree: 06cc412

Parents: 106d177

Refactor hadoop token fetch logic (#2027) 1. apply "save …

11/14/2018 1:26:42 AM

actions" plugin -- format changes only.
2. separate different token prefetching procedures into individual methods (this is done with intellij's method extract: https://www.jetbrains.com/help/idea/extract-method.html).


Follow-up -- to be done in next PR:

1. more logging for each token prefetching methods so that we know prefetching from which service is stuck.
2. see if removing "synchronized" for doPrefetch(HadoopSecurityManager_H_2_0#doPrefetch) is feasible.

Cheng Ren

Commit: 106d177

Tree: 6c87c2c

Parents: 1d251bc

Project dir cache enhancement (#2017) a race condition scenarios …

11/13/2018 10:15:19 PM

could happen when multiple azkaban executor process are running:
It's possible when two azkaban executor process perform deletion in the same azkaban project dir even when one executor is inactive. E.g, flow run on any executor by using useExecutor label.
If so, the list of azkaban project dir in memory(installedProjects) kept by azkaban executor process, will be out of sync from what's on the disk.

Another case is race condition between one executor process is deleting a project dir while another executor process is creating execution dir based on the project dir.

This PR

removes the installedProjects from executor. So every time a project needs to be downloaded, a scan of every project dir and calculation of total disk usage sum will be done to decide whether purging is needed. This could takes tens of seconds when number of project dir is >= 5000 but a few seconds with inode cache.

make project dir cleanup(deleteProjectDirsIfNecessary) synchronized. Since the method is a check-then-act process which is vulnerable to race condition when multiple threads are doing deletion. An alternative is to synchronize on an interned string of project id+project version(https://stackoverflow.com/questions/133988/synchronizing-on-string-objects-in-java), however this is not that elegant as the linked post points out. Synchronization on the object level makes sense given flow setup is low frequency operation in most cases(<= 5 ops/mins in our production environment).

when project dir is created, another metadata file keeping the file count is created. The purpose of it is is to address the race condition between one executor process is deleting a project dir while another executor process is creating execution dir based on the project dir. A sanity check on the file count will be conducted against created execution dir. If execution dir's file count is not same as base project dir, then fail the flow setup and let azkaban web server dispatch it again.

Note even with this fix, there still could be race conditions. E.g, when two executor process are calling ProjectCacheDirCleaner#deleteProjectDirsIfNecessary, one might delete a dir while the other is loading the same dir.

A potential long term fix: #2020

Follow-up
add file count sanity check mentioned above.

Cheng Ren

Commit: 1d251bc

Tree: 6719d9b

Parents: 5044552

New 'expand/collapse all flows' menu options on flow views …

11/13/2018 3:52:43 PM

(#2019)

* New ‘Expand/Collapse all Flows’ options on Flow, Flow Execution and Schedule/Execute views

Prevents users from having to unfold deeply nested flows by clicking one by one.
This improvement will allow users to browse flows and enable or disable job executions more easily.

* Fix according to review comments

* Remaining fixes according to review comments

Yeni Bermudez

Commit: 5044552

Tree: d58f390

Parents: d399897

Exclude condition in json response when condition is null. …

11/9/2018 10:05:21 PM

3.61.0

(#2022)

Jamie Sun

Commit: d399897

Tree: e3e991e

Parents: d33e82e

Finalize execution if executor doesn't exist (#2016) * Finalize …

11/6/2018 7:29:42 PM

execution if executor doesn't exist

* Fix according to review comments

* Swap missing executor warn messages according to review comment

Juho Autio

Commit: d33e82e

Tree: 1a962fe

Parents: cf940c6

Improve error message for a cleaned project version (#1993) 1st …

11/5/2018 5:21:16 PM

commit shows the problem:
When trying to execute a cleaned project version, an exception is thrown that says that hash code comparison failed.

2nd commit improves the error message in case of trying to read a cleaned (deleted) version. It also fails faster (don't try to generate a hashcode from 0 chunks).

Move project.version.retention to Constants.java.

Juho Autio

Commit: cf940c6

Tree: d3455fb

Parents: 69d2de2

Don't log an ERROR for skipped scheduled executions (#2000) If …

11/2/2018 6:03:38 PM

a flow is scheduled with concurrentOption=skip, it's perfectly normal that triggering of a schedule is skipped. This PR changes such ERROR lines in the server log to INFO level.

On a general level, in my opinion ERROR level should be only used for platform errors, ie. when Azkaban fails to do something that it promises to be able to do. If this rule holds, it will be easier to monitor that Azkaban is working correctly by checking that server logs don't contain errors.

Before:

2018/10/23 13:41:14.337 +0300 INFO [ExecuteFlowAction] Invoking flow test-project.test-flow
2018/10/23 13:41:14.338 +0300 ERROR [TriggerManager] Failed to do action Execute flow test-flow from project test-project for Trigger Id: 0, Description: Trigger from triggerLoader with trigger condition of ThresholdChecker.eval() and expire condition of EndTimeCheck_1.eval(), Execute flow test-flow from project test-project
java.lang.RuntimeException: azkaban.executor.ExecutorManagerException: Flow is already running. Skipping execution.
	at azkaban.trigger.builtin.ExecuteFlowAction.doAction(ExecuteFlowAction.java:232)
	at azkaban.trigger.TriggerManager$TriggerScannerThread.onTriggerTrigger(TriggerManager.java:363)
	at azkaban.trigger.TriggerManager$TriggerScannerThread.checkAllTriggers(TriggerManager.java:343)
	at azkaban.trigger.TriggerManager$TriggerScannerThread.run(TriggerManager.java:297)
Caused by: azkaban.executor.ExecutorManagerException: Flow is already running. Skipping execution.
	at azkaban.trigger.builtin.ExecuteFlowAction.doAction(ExecuteFlowAction.java:229)
	... 3 more
After:

2018/10/23 13:41:51.778 +0300 INFO [ExecuteFlowAction] Invoking flow test-project.test-flow
2018/10/23 13:41:51.779 +0300 INFO [TriggerManager] Skipped action [Execute flow test-flow from project test-project] for [Trigger Id: 0, Description: Trigger from triggerLoader with trigger condition of ThresholdChecker.eval() and expire condition of EndTimeCheck_1.eval(), Execute flow test-flow from project test-project] because: Flow is already running. Skipping execution.

Juho Autio

Commit: 69d2de2

Tree: 98491d0

Parents: 25cc0b1