8 Tips for Building An Effective AI System Infrastructure


By Andy Thurai, Emerging Technology Strategist, Oracle Cloud Infrastructure

If you are a strategist, or a CIO, or a decision maker for your enterprise AI systems, the following octet of AI antidotes might help cure your AI nervousness.

Andy Thurai

With the data sphere growth catapulting from the current 16.1 ZB to 163 ZB by 2025, a tenfold increase in a mere six years, almost every enterprise will struggle to apply existing business information and business analytics systems. If your future AI systems are not designed properly, these massive amounts of data can paralyze your infrastructure, systems, storage, network, security and platform. Unfortunately, that necessitates a new set of tools and infrastructure that is neither supported by the current datacenter architecture/capacity nor by the first-generation cloud capabilities.

AI is the Fastest Growing Workload for Enterprises

These next generation enterprise analytics/AI workloads will include autonomous cars, engineering simulations, DNA sequencing, weather simulations, or risk modeling. In order to run AI workloads without limitations, you need to remove the infrastructure and throughput bottlenecks first.

If not done right, this could have a chilling effect on any company’s ability to innovate or compete. It can also throttle your organizations’ growth and innovation by constantly requiring you to spend time and energy on planning your IT resources.

So, while you are planning for a true AI system, consider the following:

  • Efficiency: Right size the infrastructure for the AI workload, every time.

The size of AI workloads can vary from time to time and from model to model, making it hard to plan for the right-sized infrastructure. Enterprise IT solves the AI capacity-planning problem by building systems that can cater to the largest expected AI workload. But these workloads can now be run in a matter of minutes, thanks to powerful GPUs, which means infrastructure often sits idle for much of the time, costing money and making it inefficient. An AI infrastructure should be sized on demand for a specific AI workload, using a flexible scheduler and other infrastructure features that make it easily scalable.

  • Non-Virtual: Run it bare.

AI workloads generally do not run well on virtualized environments. When you combine that with security, co-tenancy and privacy issues, there is a reason why enterprises are very hesitant to move their HPC /AI workloads to cloud. Generally, anywhere between 10% and 15% of processing power can be wasted in maintaining the virtual environment overhead. This can be eliminated by running workloads on high performing bare metal servers to match on-prem performance.

  • Automation: Elastic scalability. Don’t forget the clean-up.

An AI infrastructure cannot be scaled up and down manually. Infrastructure components — compute, storage, networking, and even security — need to be entirely automated. More importantly, scaling down and wiping it clean when the work is done is critical. This often-forgotten, wiping-it-clean exercise is imperative as most often times the AI workloads are very sensitive. Data and the instances left behind can be prone to hacking, posing a security risk. Unused capacity sitting around will cost money as well.

  • Mix it up: Have a combination of GPUs and CPUs as needed.

Not all AI workloads require GPUs. Lots of them can be solved as easily with CPU clusters, which are less expensive. Most times, GPU workloads would still require fast, high-frequency CPUs to be paired with them. When you are planning and building AI systems, build them with enough flexibility so the mix of GPUs/CPUs can be deployed as needed.

  • Networking: Have them interconnect faster.

One of the most painfully overlooked issues is interconnectivity. Depending on the work load, you might have to spin multiple instances, or clusters of high performing instances. In case of tightly coupled workloads, such as engineering simulations or AI inferencing, there will be heavy data transfer involved. If you connect your instances with a low bandwidth network, it will kill the whole purpose. It is like having a high-powered Rolls Royce engine with a Prius transmission to move the power! Make sure your instances are connected through RDMA/RoCE networks, which can provide up to 100 Gbps and can run as fast as 1 micro-second inter instance latency.

  • Edge: AI should be able to support Edge extension.

Edge network is expanding rapidly along with 5G connectivity. Your core network should be able to easily update and move the models to the edge and have the edge infer locally as much as possible. If the inferencing is done at the edge, it will eliminate the costly round trips to the core network.

  • Fresh Hardware: Needs to be refreshed every 3 to 5 years.

These high powered AI systems generally are outdated and need to be deprecated every 3 to 5 years. A major advantage of the cloud environment is that the vendor offers the latest hardware, networking, processing and storage much sooner than enterprise data centers. With proper cost modeling, including the Capex vs Opex, the cloud may be a cheaper option in most cases.

  • Instant: Don’t wait to run the AI workloads.

Lots of customers tend to queue their AI workloads because there are other higher priority jobs running and bottling up the infrastructure. This is especially true for on-prem infrastructure where the resources are limited. Cloud computing solves this problem easily. There is no need to queue the HPC workloads and wait. In most cases, you should be able to run the workloads instantly to get immediate results.

Exponential data volume will overwhelm any enterprise IT infrastructure’s ability to do efficient AI unless infrastructure is properly planned. Plan to liberate your AI, not cripple it, with the right strategy, infrastructure, storage, security and toolset.

Andy Thurai is a technologist, strategist, and evangelist with more than 25+ years experience in the industry. He currently works as an Emerging Technology Strategist with Oracle Cloud Infrastructure. His knowledge and passion includes AI, ML, DL, Edge, IoT, Cloud and Security. 

You can reach him on Linkedin at https://www.linkedin.com/in/andythurai/, and on Twitter at https://twitter.com/AndyThurai.