Functional scheduling on a cluster

I’ve discussed this topic a month ago with Dave Curylo on #code channel of F# slack.

grafic2

giuliohome [7:05 PM]
I have a code for a daily scheduled task in a windows service:

 module Timer =

    let mutable timer_working = false
    let log = LogManager.GetLogger("TimerF")
    let _oTimerLoop = new Timers.Timer();
    let _iLoopTimer = 30.0
    let _oTimerLoop_Elapsed = new Timers.ElapsedEventHandler(fun sender -> fun e ->
      if (timer_working = true) then
        log.Info("Skip timer while previous one is working")
      else
        timer_working <- true
        _oTimerLoop.Interval <- 1000.0 * _iLoopTimer         
        try            
           Persistence.retrieveSel(connString, log)                  
                |> Utility.fireAutoRun log
                |> Seq.filter(fun r -> r.RequestStatus.Equals(RequestStatuses.Queued))
                |> Seq.iter( fun r -> Utility.runSelection r log )
        with
            | exc ->
                log.Error("Scheduled Tasks Error", exc)
        timer_working <- false )

    let StartTimer() =
        _oTimerLoop.Elapsed.AddHandler( _oTimerLoop_Elapsed)
        _oTimerLoop.Start()

My problem is that the above code is based on the assumption to run on a single application server! I ‘ve no idea how to rewrite it for a cluster of more than one application servers…

dave.curylo [8:06 PM]
@giuliohome can you run Consul or Zookeeper in you environment? If so, both support leadership election in client libraries, which is one way to achieve active-passive clustering.
A possibly more traditional option might be to use something like Quartz.NET, which can store the schedule in a database and make sure only one node picks up a job.
Actually, the second is probably a lot better for what you’re doing, since it already handles recurrent job scheduling, dealing with missed jobs, etc.

giuliohome [8:10 PM]
@dave.curylo excellent answer, thanks a lot. I’ll study and try the things you suggest!

dave.curylo [8:13 PM]
You’re welcome. Feel free to ask questions about any of those. I’m using all from F#, and even a little sample for ZK here:

Distributed Coordination with Zookeeper
Zookeeper is a system for coordinating applications and provides a framework for solving several problems that can arise when building applications that must be highly available, distributed, tolerant to network partitions and node failures:
Data update notifications. Imagine you have a few processes running to processes some data. Whenever one process is done, it needs to let the others know it’s ready for the next process to pick it up. A rudimentary way to accomplish this would be for all Show…

giuliohome [8:16 PM]
Great! Of course there is the option to go with on “official” “enterprise scheduler” but I would prefer a more modern and possibly open source approach.
Thanks again for your sample!!! Will look at it for sure :blush:

Zookeeper has to do with Hadoop… that’s were I already heard about it…
Almost no windows support (as production)

dave.curylo [8:22 PM]
The thing with Zookeeper vs. Consul for this is the protocol. Zookeeper has a special TCP protocol, Consul is all HTTP.
I think they do support Windows prod servers for ZK in more recent releases.
Also, Consul actually even goes so far as to provide commercial support options (via Hashicorp) which is sometimes a must.
Securing ZK is kind of black magic with ACL’s and tunneling the protocol. With Consul, it’s an HTTP(S) service.

giuliohome [8:29 PM]
Two questions: do you think that autosys could do the same? What about a custom code modification like putting a db lock to ensure transactional atomic execution?
Does something like this make sense? Services coordination on multiple servers is a good practice according to fsharp stack technology? I’ve only found an old msdn after a very quick googling https://msdn.microsoft.com/en-us/library/ms996526.aspx
Looks like the modern version is cloud coordination … Maybe Service Fabric

Someone from Haskell would mention distributed Stm…
Locks, Actors, And Stm In Pictures – adit.io
Aditya Bhargava’s personal blog.

dave.curylo [2:49 AM]
@giuliohome it looks like you are already connecting to a database, maybe? If so, I tend to think using that for the lock is easiest, or even using a library like quartz.net gives you scheduling of jobs on multiple machines. Using some cloud service just to coordinate that seems like a bit much to me, unless of course you can just schedule the whole job to run in the cloud, data and all. But if it has to reach back on premise, that’s a pain and probably a big point of failure.

giuliohome [2:50 AM]
Of course, all on premises

dave.curylo [2:51 AM]
Akka is a nice option, certainly you can have a cluster of actors and only one picks up the job. IIRC there is some ClusterSingleton actor that you can make that akka will do it’s best to keep only one instance running on the cluster.
If you’re going to have an actor system for the rest of this, that’s a good way to go.

giuliohome [2:56 AM]
I guess I could go with an ultra naive solution (shame on me): I have a time interval in the config for the timer to fire (and the code already checks for a previous execution)… so putting different config on different servers to make them check one after the other… Aside from this rough workaround I wanted to discuss the thing from a correct theoretical standpoint

dave.curylo [2:58 AM]
I think it’s important here to strike a balance between complexity of a distributed system and reliability of the job.

giuliohome [2:58 AM]
I was mentioning Akka because someone (ref. Scalaz and John Ⓐ De Goes) from FP sees it as an OOP “wrong” solution but again I completely agree with your comments above.
Thank you so much!

dave.curylo [3:00 AM]
If you have a lot of jobs and a lot of workers sort of polling a jobs table, taking locks, it gets to the point that it doesn’t really scale. Akka will scale like crazy, but it might be a complete architectural change that lands you with new problems (like cluster nodes to monitor).
Polling is one of those things that is incredibly reliable, because it survives even network partitions. So small scale, even if it seems naive, there is nothing wrong with it.

giuliohome [3:02 AM]
I will go with the simplest solution I described above. Glad to have your positive feedback about it!