|
11/16/2018 11:27:44 PM
execution (#2023)
* Extract method submitFlowRunner
* Extract method createFlowRunner
* Improve dispatch request handling of a previously submitted execution
- If the execution is indeed running, return OK so that dispatcher knows that it was successfully dispatched
- If the execution was left in some intermediate state, return an error so that dispatcher knows to retry or finalize the execution as failed
|
11/15/2018 11:45:28 PM
PR:
1. adds more logging and standardize existing logging for each token prefetching methods so that we know prefetching from which service is stuck.
2. removes "synchronized" for doPrefetch(HadoopSecurityManager_H_2_0#doPrefetch). Current design makes it hard to debug which token service the job is stuck with fetching token. Since HadoopSecurityManager_H_2_0 is shared by all jobs in the executor, if one job is stuck with fetching token with a problematic token service, all other jobs will be blocked from entering into this synchronized method. It's impossible to infer which token service jobs are stuck with from job logs as they are just waiting for one job to finish fetching token.
|
11/14/2018 1:26:42 AM
actions" plugin -- format changes only.
2. separate different token prefetching procedures into individual methods (this is done with intellij's method extract: https://www.jetbrains.com/help/idea/extract-method.html).
Follow-up -- to be done in next PR:
1. more logging for each token prefetching methods so that we know prefetching from which service is stuck.
2. see if removing "synchronized" for doPrefetch(HadoopSecurityManager_H_2_0#doPrefetch) is feasible.
|
11/13/2018 10:15:19 PM
could happen when multiple azkaban executor process are running:
It's possible when two azkaban executor process perform deletion in the same azkaban project dir even when one executor is inactive. E.g, flow run on any executor by using useExecutor label.
If so, the list of azkaban project dir in memory(installedProjects) kept by azkaban executor process, will be out of sync from what's on the disk.
Another case is race condition between one executor process is deleting a project dir while another executor process is creating execution dir based on the project dir.
This PR
removes the installedProjects from executor. So every time a project needs to be downloaded, a scan of every project dir and calculation of total disk usage sum will be done to decide whether purging is needed. This could takes tens of seconds when number of project dir is >= 5000 but a few seconds with inode cache.
make project dir cleanup(deleteProjectDirsIfNecessary) synchronized. Since the method is a check-then-act process which is vulnerable to race condition when multiple threads are doing deletion. An alternative is to synchronize on an interned string of project id+project version(https://stackoverflow.com/questions/133988/synchronizing-on-string-objects-in-java), however this is not that elegant as the linked post points out. Synchronization on the object level makes sense given flow setup is low frequency operation in most cases(<= 5 ops/mins in our production environment).
when project dir is created, another metadata file keeping the file count is created. The purpose of it is is to address the race condition between one executor process is deleting a project dir while another executor process is creating execution dir based on the project dir. A sanity check on the file count will be conducted against created execution dir. If execution dir's file count is not same as base project dir, then fail the flow setup and let azkaban web server dispatch it again.
Note even with this fix, there still could be race conditions. E.g, when two executor process are calling ProjectCacheDirCleaner#deleteProjectDirsIfNecessary, one might delete a dir while the other is loading the same dir.
A potential long term fix: #2020
Follow-up
add file count sanity check mentioned above.
|
11/13/2018 3:52:43 PM
(#2019)
* New ‘Expand/Collapse all Flows’ options on Flow, Flow Execution and Schedule/Execute views
Prevents users from having to unfold deeply nested flows by clicking one by one.
This improvement will allow users to browse flows and enable or disable job executions more easily.
* Fix according to review comments
* Remaining fixes according to review comments
|
11/9/2018 10:05:21 PM
3.61.0
(#2022)
|
11/6/2018 7:29:42 PM
execution if executor doesn't exist
* Fix according to review comments
* Swap missing executor warn messages according to review comment
|
11/5/2018 5:21:16 PM
commit shows the problem:
When trying to execute a cleaned project version, an exception is thrown that says that hash code comparison failed.
2nd commit improves the error message in case of trying to read a cleaned (deleted) version. It also fails faster (don't try to generate a hashcode from 0 chunks).
Move project.version.retention to Constants.java.
|
11/2/2018 6:03:38 PM
a flow is scheduled with concurrentOption=skip, it's perfectly normal that triggering of a schedule is skipped. This PR changes such ERROR lines in the server log to INFO level.
On a general level, in my opinion ERROR level should be only used for platform errors, ie. when Azkaban fails to do something that it promises to be able to do. If this rule holds, it will be easier to monitor that Azkaban is working correctly by checking that server logs don't contain errors.
Before:
2018/10/23 13:41:14.337 +0300 INFO [ExecuteFlowAction] Invoking flow test-project.test-flow
2018/10/23 13:41:14.338 +0300 ERROR [TriggerManager] Failed to do action Execute flow test-flow from project test-project for Trigger Id: 0, Description: Trigger from triggerLoader with trigger condition of ThresholdChecker.eval() and expire condition of EndTimeCheck_1.eval(), Execute flow test-flow from project test-project
java.lang.RuntimeException: azkaban.executor.ExecutorManagerException: Flow is already running. Skipping execution.
at azkaban.trigger.builtin.ExecuteFlowAction.doAction(ExecuteFlowAction.java:232)
at azkaban.trigger.TriggerManager$TriggerScannerThread.onTriggerTrigger(TriggerManager.java:363)
at azkaban.trigger.TriggerManager$TriggerScannerThread.checkAllTriggers(TriggerManager.java:343)
at azkaban.trigger.TriggerManager$TriggerScannerThread.run(TriggerManager.java:297)
Caused by: azkaban.executor.ExecutorManagerException: Flow is already running. Skipping execution.
at azkaban.trigger.builtin.ExecuteFlowAction.doAction(ExecuteFlowAction.java:229)
... 3 more
After:
2018/10/23 13:41:51.778 +0300 INFO [ExecuteFlowAction] Invoking flow test-project.test-flow
2018/10/23 13:41:51.779 +0300 INFO [TriggerManager] Skipped action [Execute flow test-flow from project test-project] for [Trigger Id: 0, Description: Trigger from triggerLoader with trigger condition of ThresholdChecker.eval() and expire condition of EndTimeCheck_1.eval(), Execute flow test-flow from project test-project] because: Flow is already running. Skipping execution.
|