Functional scheduling on a cluster

I’ve discussed this topic a month ago with Dave Curylo on #code channel of F# slack.

grafic2

giuliohome [7:05 PM]
I have a code for a daily scheduled task in a windows service:

 module Timer =

    let mutable timer_working = false
    let log = LogManager.GetLogger("TimerF")
    let _oTimerLoop = new Timers.Timer();
    let _iLoopTimer = 30.0
    let _oTimerLoop_Elapsed = new Timers.ElapsedEventHandler(fun sender -> fun e ->
      if (timer_working = true) then
        log.Info("Skip timer while previous one is working")
      else
        timer_working <- true
        _oTimerLoop.Interval <- 1000.0 * _iLoopTimer         
        try            
           Persistence.retrieveSel(connString, log)                  
                |> Utility.fireAutoRun log
                |> Seq.filter(fun r -> r.RequestStatus.Equals(RequestStatuses.Queued))
                |> Seq.iter( fun r -> Utility.runSelection r log )
        with
            | exc ->
                log.Error("Scheduled Tasks Error", exc)
        timer_working <- false )

    let StartTimer() =
        _oTimerLoop.Elapsed.AddHandler( _oTimerLoop_Elapsed)
        _oTimerLoop.Start()

My problem is that the above code is based on the assumption to run on a single application server! I ‘ve no idea how to rewrite it for a cluster of more than one application servers…

dave.curylo [8:06 PM]
@giuliohome can you run Consul or Zookeeper in you environment? If so, both support leadership election in client libraries, which is one way to achieve active-passive clustering.
A possibly more traditional option might be to use something like Quartz.NET, which can store the schedule in a database and make sure only one node picks up a job.
Actually, the second is probably a lot better for what you’re doing, since it already handles recurrent job scheduling, dealing with missed jobs, etc.

giuliohome [8:10 PM]
@dave.curylo excellent answer, thanks a lot. I’ll study and try the things you suggest!

dave.curylo [8:13 PM]
You’re welcome. Feel free to ask questions about any of those. I’m using all from F#, and even a little sample for ZK here:

Distributed Coordination with Zookeeper
Zookeeper is a system for coordinating applications and provides a framework for solving several problems that can arise when building applications that must be highly available, distributed, tolerant to network partitions and node failures:
Data update notifications. Imagine you have a few processes running to processes some data. Whenever one process is done, it needs to let the others know it’s ready for the next process to pick it up. A rudimentary way to accomplish this would be for all Show…

giuliohome [8:16 PM]
Great! Of course there is the option to go with on “official” “enterprise scheduler” but I would prefer a more modern and possibly open source approach.
Thanks again for your sample!!! Will look at it for sure :blush:

Zookeeper has to do with Hadoop… that’s were I already heard about it…
Almost no windows support (as production)

dave.curylo [8:22 PM]
The thing with Zookeeper vs. Consul for this is the protocol. Zookeeper has a special TCP protocol, Consul is all HTTP.
I think they do support Windows prod servers for ZK in more recent releases.
Also, Consul actually even goes so far as to provide commercial support options (via Hashicorp) which is sometimes a must.
Securing ZK is kind of black magic with ACL’s and tunneling the protocol. With Consul, it’s an HTTP(S) service.

giuliohome [8:29 PM]
Two questions: do you think that autosys could do the same? What about a custom code modification like putting a db lock to ensure transactional atomic execution?
Does something like this make sense? Services coordination on multiple servers is a good practice according to fsharp stack technology? I’ve only found an old msdn after a very quick googling https://msdn.microsoft.com/en-us/library/ms996526.aspx
Looks like the modern version is cloud coordination … Maybe Service Fabric

Someone from Haskell would mention distributed Stm…
Locks, Actors, And Stm In Pictures – adit.io
Aditya Bhargava’s personal blog.

dave.curylo [2:49 AM]
@giuliohome it looks like you are already connecting to a database, maybe? If so, I tend to think using that for the lock is easiest, or even using a library like quartz.net gives you scheduling of jobs on multiple machines. Using some cloud service just to coordinate that seems like a bit much to me, unless of course you can just schedule the whole job to run in the cloud, data and all. But if it has to reach back on premise, that’s a pain and probably a big point of failure.

giuliohome [2:50 AM]
Of course, all on premises

dave.curylo [2:51 AM]
Akka is a nice option, certainly you can have a cluster of actors and only one picks up the job. IIRC there is some ClusterSingleton actor that you can make that akka will do it’s best to keep only one instance running on the cluster.
If you’re going to have an actor system for the rest of this, that’s a good way to go.

giuliohome [2:56 AM]
I guess I could go with an ultra naive solution (shame on me): I have a time interval in the config for the timer to fire (and the code already checks for a previous execution)… so putting different config on different servers to make them check one after the other… Aside from this rough workaround I wanted to discuss the thing from a correct theoretical standpoint

dave.curylo [2:58 AM]
I think it’s important here to strike a balance between complexity of a distributed system and reliability of the job.

giuliohome [2:58 AM]
I was mentioning Akka because someone (ref. Scalaz and John Ⓐ De Goes) from FP sees it as an OOP “wrong” solution but again I completely agree with your comments above.
Thank you so much!

dave.curylo [3:00 AM]
If you have a lot of jobs and a lot of workers sort of polling a jobs table, taking locks, it gets to the point that it doesn’t really scale. Akka will scale like crazy, but it might be a complete architectural change that lands you with new problems (like cluster nodes to monitor).
Polling is one of those things that is incredibly reliable, because it survives even network partitions. So small scale, even if it seems naive, there is nothing wrong with it.

giuliohome [3:02 AM]
I will go with the simplest solution I described above. Glad to have your positive feedback about it!

6 thoughts on “Functional scheduling on a cluster

  1. IMHO, the solution to implement the schedule for a job should be different if you are talking about a target that is clustered with one active – one passive node or if you are talking about many indepedent nodes in a web farm that hosts an Application Server

    Like

    • Hi! Here I’m dealing with an active-active cluster and my goals are to accomplish both failover and round-robin at a basic level. Interested in your competent comments, as usual.

      Like

      • The Admin process that runs any cluster usually offers an API for client processes to retrieve info on references to the state of the cluster, including references to the current active node.

        Most frameworks that offer cluster automation routines are based on these APIs. These frameworks usually include event oriented interfaces so that you can subscribe your event handler code to the event for automatic failover and to other relevant events for clusters.

        I would design the logic for the job based on both starting with getting a reference to the current active node of the cluster, and previously subscribing an event handler to the failover event, so that your job logic is robust & resilient to failovers and it stays consistent even after a failover.

        Kind regards, GEN

        Liked by 1 person

        • Good to know, thanks. Notice however that my post doesn’t refer to the admin process but to an application service on top of it – which abstracts out the details of the specific infrastructure API layer. Of course I’m grateful for your comments and I agree with you: that’s part of a much broader topic, sometimes even hyped under the name of porting an app to the cloud. I’ve recently read a good book on the correct cloud approach and in fact it confirms how adapting an app to run over multiple application servers is a core part and there are a couple of main design patterns from the architecture perspective (a db lock based solution like the one described here in my post is the typical one). Well, again, in the context of load balancing for multiple application servers. So, finally, thanks for tips from the admin side of the cluster management! 😉

          Like

  2. In the case of Windows HPC infrastructure solutions (these are bundles of W/SW from MS and some HW partners), there is a similar API that offers failover clustering admin functions, but is more complex as it includes an OO API to manage a complete HPC Compute solution, including complete clustering management.

    Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s