Academia.eduAcademia.edu

Parallel Computing

description32,909 papers
group49,292 followers
lightbulbAbout this topic
Parallel computing is a computational paradigm that divides a problem into smaller sub-problems, which are solved simultaneously by multiple processors or cores. This approach enhances computational speed and efficiency, enabling the handling of large-scale problems that would be time-prohibitive on a single processor.
lightbulbAbout this topic
Parallel computing is a computational paradigm that divides a problem into smaller sub-problems, which are solved simultaneously by multiple processors or cores. This approach enhances computational speed and efficiency, enabling the handling of large-scale problems that would be time-prohibitive on a single processor.

Key research themes

1. How can programming models and architectural designs optimize performance and scalability in parallel computing?

This research theme investigates the comparative efficiency, adaptability, and execution models of programming paradigms and hardware architectures designed to leverage parallelism. Understanding these elements is crucial for improving performance on modern heterogeneous systems, overcoming bottlenecks such as scalability and resource management, and guiding the development of future computing platforms and programming frameworks.

Key finding: The paper presents an in-depth performance analysis of a compute-bound application across six portable parallel programming models—OpenMP, OpenCL, CUDA, OpenACC, Kokkos, and SYCL—highlighting relevant trade-offs in... Read more
Key finding: This work proposes a massively parallel processor (MPP) prototype with a distributed shared memory architecture that combines coarse-grained RISC cores and fine-grained operations through a novel coherence protocol and... Read more
Key finding: The authors provide a formal, generalized functional specification of the divide-and-conquer pattern and prove its equivalence to several classical parallel programming patterns (skeletons). They analyze possible... Read more
Key finding: This paper examines cluster computing architectures built from commodity hardware and standard networking technologies. It highlights key interconnect metrics (bandwidth and latency) and communication software such as MPI and... Read more

2. What are effective optimization strategies and algorithmic innovations for large-scale parallel problem solving in real-world computational domains?

This theme focuses on the development and empirical validation of optimization algorithms and scheduling techniques to enhance efficiency of parallel computing in practical applications such as cloud resource allocation, multi-core processor memory management, and scheduling algorithms. Addressing issues like load balancing, execution speed, resource utilization, and dynamic scheduling, these strategies are essential for meeting stringent performance and scalability demands in diverse computational environments.

Key finding: The paper proposes a Parallel Ant Colony Optimization (PACO) algorithm that leverages data parallelism to optimize container allocation in cloud environments, improving load balancing, resource utilization, and execution... Read more
Key finding: This study introduces a runtime dynamic detection-based random sampling algorithm for Scratchpad Memory (SPM) allocation in multi-core processor environments, overcoming the limitations of static compiler-based predictions.... Read more
Key finding: The authors design flexible algorithms for balanced problem subdivision and efficient client-server communication over distributed memory systems, validated through matrix algebra case studies. Their approach addresses load... Read more
Key finding: Focused on cloud AI systems, this research develops parallel algorithm optimizations—leveraging distributed frameworks and load balancing—that achieve a 40% reduction in processing times while improving resource utilization... Read more

3. How can specialized hardware designs and emerging computational paradigms improve energy efficiency and functional capabilities in parallel computing?

This research direction explores novel hardware architectures such as CGRAs, photonic clocking mechanisms, and GPU-accelerated state space exploration aimed at enhancing energy efficiency, programmability, and scalability. The investigations are critical for embedded, ultra-low-power computing, high-throughput model checking, and clock signal generation, revealing how hardware innovations can complement parallel programming to push performance-per-watt boundaries and unlock new application domains.

Key finding: R-Blocks is presented as a ULP coarse-grained reconfigurable architecture embodying a flexible VLIW-SIMD execution model and software bypassing to maximize energy efficiency (115 MOPS/mW) on complex workloads like FFT.... Read more
Key finding: This paper introduces a novel photonic clocking mechanism leveraging GaN micro-LEDs and light-sensitive FinFET transistors to generate clock signals with frequencies up to 100 GHz, achieving up to 90% energy savings and... Read more
Key finding: GPUEXPLORE 3.0 advances GPU-accelerated explicit state space exploration by integrating a code generator for model-specific GPU kernels, implementing novel GPU data structures like binary tree hash tables with compact hashing... Read more

All papers in Parallel Computing

With the deployment of innovative memories such as non-volatile memory and 3D-stacked memory in distributed systems, how to improve the application performance by utilizing the unique characteristics of these hybrid memories remains an... more
Cooperating with the Intel® Many Integrated Core Architecture which was announced in 2010 as a massively parallel coprocessor, heterogeneous node has been broadly applied in petascale supercomputers, such as TianHe-II. The performance... more
High inference latency seriously limits the deployment of DNNs in real-time domains such as autonomous driving, robotic control, and many others. To address this emerging challenge, researchers have proposed approximate DNNs with reduced... more
The recent development of deep learning has mostly been focusing on Euclidean data, such as images, videos, and audios. However, most real-world information and relationships are often expressed in graphs. Graph convolutional networks... more
Adaptive mesh refinement (AMR) is one of the most widely used methods in High Performance Computing accounting a large fraction of all supercomputing cycles. AMR operates by dynamically and adaptively applying computational resources... more
Deep Neural Networks (DNNs) have revolutionized numerous applications, but the demand for ever more performance remains unabated. Scaling DNN computations to larger clusters is generally done by distributing tasks in batch mode using... more
In this paper, we propose an architecture design called Ultra-Workload-Balanced-GCN (UWB-GCN) to accelerate graph convolutional network inference. To tackle the major performance bottleneck of workload imbalance, we propose two... more
Binarized neural networks (or BNNs) promise tremendous performance improvement over traditional DNNs through simplified bitlevel computation and significantly reduced memory access/storage cost. In addition, it has advantages of low-cost,... more
Single-Instruction-Multiple-Data (SIMD) architectures are widely used to accelerate applications involving Data-Level Parallelism (DLP); the on-chip memory system facilitates the communication between Processing Elements (PE) and onchip... more
The size of deep neural networks (DNNs) grows rapidly as the complexity of the machine learning algorithm increases. To satisfy the requirement of computation and memory of DNN training, distributed deep learning based on model... more
As part of the DOE LQCD-ext project, Fermilab designs, deploys, and operates dedicated high performance clusters for lattice QCD (LQCD) computations. We describe the design of these clusters, as well as their performance and the... more
published version of the paper. Such differences, if any, are usually due to reformatting during preparation for publication or minor corrections made by the author(s) during final proofreading of the publication manuscript.
This paper presents two modular multipliers with their efficient architectures based on Booth encoding, higher-radix, and Montgomery powering ladder approaches. Montgomery powering ladder technique enables concurrent execution of main... more
The ProxSkip algorithm for distributed optimization is gaining increasing attention due to its effectiveness in reducing communication. However, existing analyses of ProxSkip are limited to the strongly convex setting and fail to achieve... more
A wide array of image recovery problems can be abstracted into the problem of minimizing a sum of composite convex functions in a Hilbert space. To solve such problems, primal-dual proximal approaches have been developed which provide... more
A wide array of image recovery problems can be abstracted into the problem of minimizing a sum of composite convex functions in a Hilbert space. To solve such problems, primal-dual proximal approaches have been developed which provide... more
A wide array of image recovery problems can be abstracted into the problem of minimizing a sum of composite convex functions in a Hilbert space. To solve such problems, primal-dual proximal approaches have been developed which provide... more
In this paper we propose a model of distributed multi-core processors with software controlled dynamic voltage scaling. We consider the problem of energy efficient task scheduling with a given deadline on this model. We consider... more
A Distributed Computing System (DCS) comprising networked heterogeneous processors requires efficient tasks to processor allocation to achieve minimum turnaround time and highest possible throughput. Task allocation in DCS remains an... more
Neural processing units (NPUs) have become indispensable parts of mobile SoCs. Furthermore, integrating multiple NPU cores into a single chip becomes a promising solution for ever-increasing computing power demands in mobile devices. This... more
The sparse matrix-vector (SpMV) multiplication routine is an important building block used in many iterative algorithms for solving scientific and engineering problems. One of the main challenges of SpMV is its memory-boundedness.... more
Parallelization of existing code for modern multicore processors is tedious as the person performing these tasks must understand the algorithms, data structures and data dependencies in order to do a good job. Current options available to... more
Simulations provide a flexible and valuable method to study the behaviors of information propagation over complex social networks. High Performance Computing (HPC) is a technology that allows the implementation of efficient algorithms on... more
Directive-based programming approaches such as OpenMP and OpenACC have gained popularity due to their ease of programming. These programming models typically involve adding compiler directives to code sections such as loops in order to... more
Stencils represent an important class of computations that are used in many scientific disciplines. Increasingly, many of the stencil computations in scientific applications are being offloaded to GPUs to improve running times. Since a... more
An efficient resource allocation is a fundamental requirement in high performance computing (HPC) systems. Many projects are dedicated to large-scale distributed computing systems that have designed and developed resource allocation... more
The computational design of peptide binders towards a specific protein interface can aid diagnostic and therapeutic efforts. Here, we design peptide binders by combining the known structural space searched with Foldseek, the protein... more
Efficient data sorting is important for searching and optimization algorithms in high time demanding fields such as image and multi-media data processing. To accelerate the data sorting algorithm applied in practical normalized... more
As a new area of machine learning research, the deep learning algorithm has attracted a lot of attention from the research community. It may bring human beings to a higher cognitive level of data. Its unsupervised pre-training step allows... more
Through cross-layer approximation of Deep Neural Networks (DNN) significant improvements in hardware resources utilization for DNN applications can be achieved. This comes at the cost of accuracy degradation, which can be compensated... more
The richness of wavelet transformation has been known in many fields. There exist different classes of wavelet filters that can be used depending on the application. In this paper, we propose a general purpose lifting-based wavelet... more
In this paper we present three different parallelizations of a discrete radiosity method achieved on a cluster of workstations. This radiosity method has lower complexity when compared with most of the radiosity algorithms and is based on... more
In this paper we present three different parallelizations of a discrete radiosity method achieved on a cluster of workstations. This radiosity method has lower complexity when compared with most of the radiosity algorithms and is based on... more
This paper aims to synthesize a uniform rectangular array (URA) which spans beamwidth of -30° to 30° in the azimuthal direction with the interference direction as 40° in the azimuth plane. In this paper, a Modified Genetic Algorithm is... more
An Image Mining System (IMS) requires real time processing often using special purpose hardware. The work herein presented refers to the application of cluster computing for on line image processing inside an IMS, where the end user... more
An efficient Fast logarithmic successive cancellation stack (Log-SCS) polar decoding algorithm is proposed along with its software implementation using single instruction multiple data (SIMD) style processing. Quantitatively, we reduce... more
Floating-point summation is one of the most important operations in scientific/numerical computing applications and also a basic subroutine (SUM) in BLAS (Basic Linear Algebra Subprograms) library. However, standard floating-point... more
Deploying deep neural networks (DNNs) on resource-constrained devices often requires hybrid edge–cloud inference, where both end-to-end latency and communication overhead must be optimized jointly. Existing Dynamic Split Computing (DSC)... more
The Single-Chip Cloud Computer (SCC) is an experimental processor created by Intel Labs. It comprises 48 Intel-IA32 cores linked by an on-chip high performance mesh network, as well as four DDR3 memory controllers to access an off-chip... more
The Single-Chip Cloud Computer (SCC) is an experimental processor created by Intel Labs. It comprises 48 Intel-x86 cores linked by an on-chip high performance mesh network, as well as four DDR3 memory controllers to access an off-chip... more
Partitioning a set of N patterns in a d-dimensional metric space into K clusters -in a way that those in a given cluster are more similar to each other than the restis a problem of interest in astrophysics, image analysis K _ and other... more
A concurrent FIFO queue is a widely used fundamental data structure for parallelizing software. In this letter, we introduce a novel concurrent FIFO queue algorithm for multicore architecture. We achieve better scalability by reducing... more
Cache Hierarchy for 100 On-Chip Cores Mohamed Zahran Department of Electrical Engineering City University of New York (mzahran@ ccny. cuny. edu) Abstract The increase in the number of on-chip cores, as well as the so-phistication of each... more
It has long been recognized that many direct parallel tridiagonal solvers are only efficient for solving a single tridiagonal equation of large sizes, and they become inefficient when naively used in a three-dimensional ADI solver. In... more
Download research papers for free!