distributed file system

Systems don’t last forever. Whether through disc corruption, component burn-out or external factors like fire or flood, servers eventually reach the end of their working lives – some sooner than others. When they do, what happens to the data they contain? 

Organisations with effective backup procedures will recover, given time, but they’ll experience downtime and potential loss of revenue if they relied on a single server for their big data processing

Distributed computing recognises the inevitable fallibility of computer hardware, and mitigates, spreading data and processing load across several machines, potentially located in multiple sites. Not only does this improve resilience and reduce the likelihood of downtime: it can also accelerate operations as processes can be run in parallel. 

Hadoop: Built to detect failure and recover 

Apache’s open source Hadoop framework was designed to facilitate precisely this kind of function, allowing networks of discrete computers to work together on data-intensive processing tasks while also managing distributed storage.  

As the project explains, “rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly available service on top of a cluster of computers, each of which may be prone to failures.” 

Benefits of Hadoop 

Hadoop is tried and tested, well supported, and widely deployed. It’s also free to download and implement, and, says Mark Smallcombe at Integrate, “delivers high availability and fast processing speeds for large data stores that can include both structured and unstructured data. It also delivers high throughput and, as it is an open-source project, developers can adjust the Java code to suit their requirements.” 

One of the primary benefits of Hadoop is that it can run on inexpensive ‘commodity’ hardware while delivering supercomputer-like results. However, warns IBM, “though it’s open source and easy to set up, keeping the server running can be costly. When using features like in-memory computing and network storage, big data management can cost up to $5,000 USD”. 

How does Hadoop work? 

It works by breaking down large data sets and distributing them across multiple nodes in a cluster. Each piece of data is replicated at least three times, and the code that processes it is hosted on the same machine as the data itself. Thus, each is running locally to reduce execution time, and in parallel with code on other nodes to further reduce the overall job completion time. 

As Michael Malak writes at the Data Science Association, “Hadoop brings the compute to the data”. 

It can be implemented on premises or in the cloud, and the server location of every process is tracked. This way, should one instance fail, its workload can be ported to a different server within the same node, where it will remain as close as possible to the data on which it relies. 

Alternative options for big data processing 

Naturally, Hadoop isn’t the only option for distributing and processing big data. Merit’s Technical Project Manager, Abdul Rahuman, says that “With modern Big Data cloud technologies like Spark and Hadoop, cloud data warehousing has now become as simple as finding the right platform/services e.g. AWS EMR, Glue, Redshift, Databricks, Snowflake etc.   

Organisations can create, manage and scale their bigdata eco systems with simple clicks and do not need to worry about time and effort that used to be a nightmare. There is also the added benefit of decoupling storage and dynamic scaling up or down of application and data processing performance, whilst saving on operating cost and reducing the risk of compromising security.” 

Spark: A faster alternative to Hadoop 

Spark can likewise distribute large data sets across a series of nodes. “However,” notes IBM, “it tends to perform faster than Hadoop and it uses random access memory (RAM) to cache and process data instead of a file system. This enables Spark to handle use cases that Hadoop cannot.” 

It also means that, depending on the parameters of the job, Spark can deliver results more quickly than Hadoop, by a significant margin. 

Google BigQuery: PaaS for real-time analysis of data 

Google BigQuery is designed to handle petabytes of data in a Platform as a Service (PaaS) model. With built-in machine learning capability, one of its optimal functions is predictive modelling, using simple SQL, and it’s also optimised for analysis of real time data which, as DS Stream outlines, “allows companies to make decisions and take actions almost immediately after receiving information.  

Real-time systems are used in air traffic control, banking (in ATMs, when customers need to see a current view of their finances seconds after performing a transaction) and many other industries”. 

Amazon Redshift for larger quantities of data 

Similarly, Amazon Redshift can perform analytics on large quantities of data, to deliver both forecasts and real-time results on which organisations can base immediate actions. Like Hadoop, it uses a node and cluster model, with a Redshift engine assigned to each cluster. 

Both Redshift and BigQuery are cloud hosted, by Amazon and Google respectively, giving enterprises the ability to flex resources as their needs change. “Redshift can be scaled up or down by quickly activating individual nodes of varying sizes,” notes Sisense. “This scalability also means cost savings, as companies aren’t forced to spend money maintaining servers that are unused or to quickly purchase expensive server space when the need arises. This is especially useful for smaller companies which experience significant growth and must scale their existing solutions.” 

Kubernetes vs Hadoop’s YARN 

Kubernetes, as a replacement for Hadoop’s YARN resource management component, is posing a challenge to Hadoop in the big data application space. In part, says Twain Taylor at TechGenix, this is because “Hadoop was built during an era when network latency was one of the major issues faced by organisations. However, with the evolution of the concept of microservices and container technology, organisations are quickly realising that hosting the entire data on cloud storage provides several additional advantages.  

Several components of the Big Data stack (such as Spark and Kafka) can be hosted and operated on the cloud-based Kubernetes environment more efficiently. Applications hosted on containers can be easily started on-demand and shut down as per requirements.” 

The Challenges of using Hadoop 

Although Hadoop is long-established and widely implemented, alternative solutions are becoming more appealing due to their increased flexibility and ease of use. Hadoop is developed in Java, which BMC explains “is not the best language for data analytics, and it can be complex for new users. This can lead to complications in configurations and usage—the user must have thorough knowledge in both Java and Hadoop to properly use and debug the cluster.” 

While Hadoop faces increased competition, its future remains bright, and enterprises should remain confident that building big data applications on the platform is as viable a long-term proposition than for any other big data processing and storage model.  

Indeed, Market Trends predicted, in late 2021, that “the global Hadoop Market is expected to be valued at US$ US$ 404.4 Bn by 2028, exhibiting a CAGR of 37.9% during the forecast period”.

Related Case Studies

  • 01 /

    A Unified Data Management Platform for Processing Sports Deals

    A global intelligence service provider was facing challenge with lack of a centralised data management system which led to duplication of data, increased effort and the risk of manual errors.

  • 02 /

    Document Collection and Metadata Management System For the Pharmaceutical Industry

    A leading provider of data, insight and intelligence across the UK healthcare community needed quick and reliable access to a vast number of healthcare documents that are published everyday in the UK healthcare community.