robot unstable

True stories from the frontlines of automation

In our field, we hear all too often common and indeed valid concerns regarding the technical stability of robots.  There’s a wide-spread and prevalent underlying doubt about Robotic Process Automation (RPA) being a valid, stable, and unified approach to automation in any organization.  Indeed, it is true there is a possible potential loss of robotic productivity due to down time or failures.  For clients with existing automation projects, whether it being RPA or any other automation technology, these thoughts can be on top of the list of concerns when making a decision to automate or not.  For new clients, with no existing knowledge of the potential pitfalls of automation it’s important to also highlight these risks in order for them to be completely informed.

Additionally, a common fear we’ve noticed is that once RPA is embedded into the organisation, any day where the robot doesn’t work could potentially lead to immediate panic and confusion among internal employees.  After a prolonged period of the robot running optimally, employees and their managers grow used to the decreased workload, and they become busy with other tasks; a good thing, after all, it is the purpose of automation.

Suddenly the doubt creeps in, can we really trust the robot to perform this job? If the robot were to stop working now, we’d be in a lot of trouble…

We wanted to write this article, given our experience across numerous business sectors, to help clarify some of these concerns and really touch on what the core problems are when talking about RPA.


It’s not always the poor robot’s fault

If you have a robot that’s acting up and you’re seeing errors thrown left and right, assuming you’ve had an expert team like us at BrightKnight do the development (wink wink), and a reasonable period of hypercare/aftercare once live in production, the problem, most of the time, is not with the robot itself, but instead can be commonly found in the applications and infrastructure that the robot interacts with.

Drilling into BrightKnight’s expansive project experience we’ve compiled a small list of the most common incident occurrences we have faced in our projects. We will touch on these issues and propose tried and tested risk mitigation techniques to help you in your automation journey.


Common issues once live

Target application stability and roadmaps

This is the big one. Errors suddenly appear that are not related to input or development of the robot, but are instead more general, such as input delay/system response delay or even target applications simply not starting, working, or just being down for any number of reasons.

The roadmaps of these applications (upgrades, version releases, etc) are often fully loaded and take too little account of those that interact with these tools on a day to day basis, robots included. End users of these applications can be confronted with the unfortunate discovery of cosmetic and even more fundamental changes, these small changes, a new button, a changed screen title, can impact end user routines and equally the pre-defined tasks of the robots.

End user experience level:

It could become apparent that once the handover of operation of the robot has been done towards the business, there’s a lack of experience working with robots.  Sometimes this lack of experience leads to improperly reading the errors/exceptions and jumping to wrong conclusions. It’s important to have a fully trained team and a reliable support network with the development team after going live.

Changes to applications:

It is very likely, in our experience, that the changes to applications are not communicated to the Robotics team leading to a whole host of production issue

Server/virtual machine issues

A lot of the time the development of our robots  is done on virtual machines, these machines need to be configured correctly with the level of resources and permissions required to perform the operations dictated in the design of the bot.


Ways of mitigating the risks

Clear & Direct Communication with IT

It’s the secret to a happy marriage and, funnily enough, also to your robot’s issues. Knowing the Points of Contact (POC) at IT for specific issues, from infrastructure to scheduling, is crucial to keep robots running smoothly. A quick direct message to the right POC can tell you exactly what you need to know.

Awareness of system/application upgrades, downtime, maintenance, …

In line with the point of above, communication is again key. But where the previous point touches more on the day-to-day operations, this one is more structural.

It is vital that the RPA project team is kept in the loop and made a part of discussions related to system and application upgrades, downtime, and maintenance.

If your robot stops working because a user interface changed over the weekend, and you had no clue that was going to happen, this is where you should start your incident investigation.

Most organisations already have these notification systems and procedures in place for a while, and often it is simply a question of incorporating RPA into the existing governance.

Resilience as a design choice

The first line of defense should be your robots themselves.

Rather quickly you will find out that this is not the case. Luckily, the solution is simple: build resilient robots that can handle failure as a core design concept and above all, test, test and just when you think you’ve tested it enough, do it one last time, just for fun.

Proactive Controlling

Can your controller see into the future? Probably not, but if they can, seriously, give us their number. If however they see a certain application has been failing multiple days in a row, leading to downtime of the robots, it’s worth investigating in more detail and to raise the issue with developers and IT, instead of just restarting the run, washing your hands, and calling it a day.

Controlling is not only about reacting to incidents as they happen but equally about anticipating performance issues based on these incidents.



In summary, any sort of automation is complex, it requires extensive testing and an expert ability to deduce the issues and risks that are not necessarily in plain sight.  You’ll usually experience some form of errors, it’s all but inevitable,  but with an adequate hypercare/aftercare period, proper testing and design methodology you minimize the risk of any errors being related to the direct development of the robot, further allowing your BAU to flow smoothly.  Most of the time, the main errors you’ll experience are from changes to the underlying environment the robot interacts with and not with the robots development itself. In the end it’s just doing what it’s been told to do and can’t be faulted for working on shaky ground. That being said, even if it is working in an unstable environment, the mitigating factors listed above will help reduce the risk of any major issues.


Interested in getting more out of your robots and reaping the immediate benefits of a digitally automated environment? Get in touch with us by sending a mail to .


Article written by Daniel Fastenau & Nico Esposito