Zujie Ren

出自cloud
跳转到: 导航, 搜索

目录

Zujie Ren(任祖杰)

Zujie Ren@Copenhagen, Denmark

Birth:1984/10
Hometown: Ganzhou, Jiangxi Province, China
Graduate:Zhejiang University
Degree: Ph.D in Computer Science
Title: Associate Professor
Favor Proverbs: Yesterday is history, tommorow is mystery, but today is a gift, that is why it is called the present.
Email: renzju@gmail.com (Recommended)
renzj@hdu.edu.cn (NOT recommended due to email loss sometimes)
Mail Address: College of Computer Science and Technology, Hangzhou Dianzi University, XiaSha High Education District, Hangzhou, Zhejiang, China 310018

Brief Bio

Zujie Ren is an associate professor with the College of Computer Science and Technology, Hangzhou Dianzi University. Zujie is currently working at the cloud computing research institute of Hangzhou Dianzi University. Zujie studied at Database Lab in Computer Science the Zhejiang University and received his Ph.D degree in Sep. 2010. Prior to joining the faculty at Oct. 2010, he had worked in the NetEase Research Hangzhou for more than three years for his internship. From Sep. 2011 to Jun. 2012, he had worked with Data Platform Team in Alibaba Inc., focusing on the workload characterization and performance optimization for Taobao Hadoop cluster (called Yunti 云梯). From May 2013 to up date, he has been working with Aspara (飞天) Team in Aliyun, Inc, focusing on the log analysis of Aspara, which is distributed computing platform developed by Aliyun.

Zujie's research interests include workload analysis and performance optimization of cloud computing or other distributed systems.

Research Fields

Massive data processing

With the rapid growth of data volume in many enterprisers' IT infrastructure, large-scale data processing become an urgent challenge and receive a plenty of attention from both academical and industrial fields. MapReduce framework, proposed by Google, provides a highly scalable solution for data processing. The fundamental of MapReduce is to distribute data among large number of nodes and processing the data in parallel. Hadoop is an open-source implementation of MapReduce framework, can easily scale out to thousands of nodes and work with petabyte data. Due to the high scalability and performance, Hadoop has gained much popularity and widely usage. A lot of company, such Yahoo, Facebook, and research groups use Hadoop to run their data-intensive jobs, such as click-log analysis, web crawling and data mining.

Our work was originally motivated by the Hadoop cluster at Taobao. Taobao is a leading online e-commence company in China with more than 70% share. To provide large-scale data processing service to other group company, Taobao construct a giant data warehouse on Hadoop. System log, crawled pages and replicas of online relational database(Oracle or MySQL), are gathered in this cluster continuously, where they are used for numerous applications, including traffic statistics, product sales trends and recommender systems. This data warehouse runs on more than 2,000 nodes and stores more than 20PB of compressed data which is growing at the rate 20TB per day. Besides production jobs that must run periodically, there are many temporary jobs, ranging from multi-hour collaborate filtering computations to several seconds ad-hoc queries. Most of the production jobs are automatically run during the mid-night, and they should be completely finished before the users start to work in the morning. While for the temporary jobs, most of them are submitted by the Taobao engineers on the working time and run immediately. The number of jobs run in the Hadoop cluster per day exceeds forty thousands, the number of users is more than five hundred.

WaxElephant
A Realistic Hadoop Simulator for Parameters Tuning and Scalability Analysis]]

Ankus
Ankus: E-Commerce Workload Synthesization

Resource Scheduling in Data-Centric Systems

With the explosive growth of data volumes, more and more organizations build large-scale data-centric systems (DCS), which serve as infrastructures for various applications involving "big data". As data-centric systems continue to grow, so does the need for effective resource scheduling. Most data-centric systems, equipped with limited resources, need to serve multiple users and execute various workloads simultaneously. Resource scheduling is essential for improving system throughput and user response time. However, the resource scheduling problem in data-centric systems is challenging due to the system complexity and workload diversity. To date, various scheduling techniques have been proposed and applied in different instances of data-centric systems, such as cloud computing platforms, HPC clusters and MapReduce-style systems. Thus, an overall landscape of the current research within this field will benefit researchers greatly. We aim to give a comprehensive survey on the existing research advances, which is helpful to optimize resource scheduling techniques. We initiatively categorize the resource scheduling approaches into three groups according to the scheduling model: resource provision, job scheduling and data scheduling. We give a systematic review of the most significant techniques for each group, and present some open problems yet to be addressed. Then, we discuss four case studies, each of which is carefully chosen from practical or productional systems. Finally, we outline some open problems and challenges within the area of resource scheduling. We believe this systematic and comprehensive analysis will provide a much greater insight the existing scheduling techniques and inspire new developments within this field.

Log analysis and Problem Diagnosis on Large-scale Systems

System logs have provided a rich information source for failure detection, failure prediction and root cause diagnosis, but with the continuous increase of the system size, it is a challenging task to collect, analyze and manage logs. However, with the rapid growth of the system scale and the popularity of various applications in productional environments, the volume of logs emerged per day becomes huge, posing serious challenges for storage and analysis. Log filtering technology has been widely used in system log analysis and handling process. The existing research can be approximately divided into the instance based method and the feature based method. The instance based approach is generally used to identify instances containing abnormal information and delete instances with redundant information. To solve these problems, we focus on developing an online log filtering mechanism to eliminate the redundant and noisy log records through event filtering and instance filtering, aiming to minimize the log size without losing important information required for the fault diagnosis.

Benchmarking for Cloud OS

Over the past few years, cloud file systems such as Google File System (GFS) and Hadoop Distributed File System (HDFS) have received significant research efforts to optimize their mechanisms and implementations. A common issue for these system optimization efforts is performance benchmarking. However, many system researchers and engineers face challenges on making a benchmark that reflects real-life workload cases, due to the system complexity and vagueness of I/O workload characteristics. They could easily make incorrect assumptions about their systems and workloads, leading to the benchmark results do not accord with the fact.

As the preliminary step for making a realistic benchmark, we make efforts to explore the characteristics of data and I/O workload in a production environment. We collect a two-week I/O workload trace from a 2,500-node production cluster, which is one of the largest cloud platform in Asia. This cloud platform provides two public cloud services: data storage service and data processing service. We analyze the commonalities and individualities between both cloud services in multiple perspectives, including request arrival pattern, request size, data population and so on. Key observations include the request arrival rate follows a log-normal distribution rather than Poisson distribution, request arrival presents multiple periodicities, cloud file systems fit partly-open model rather than completely open model, etc. Based on the comparative analysis results, we derive some interesting implications for guiding system researchers and engineers to build a realistic benchmark on their own systems. We discuss several open issues and challenges raised on benchmarking cloud file systems.
Porcupine
https://github.com/renzj/porcupine
iGen
https://github.com/renzj/iGen

Professional Activities

Program (Co-)Chair: MDSP 2012, MDSP 2013
Program Commitee: FM-S&C'11, FM-S&C'12
Reviewer: Computer Journal, Oxford University Press

Fundings

No.: 61300033 Title: Data Scheduling for Task Acceleratin on Massive Data Processing
Role: Director
Fund: 270,000.00RMB

No.: LQ12F02002.
Title: Research on KV Storage Engine-based Metadata Server Cluster
Role: Director
Fund: 50,000RMB

No.: Y18F020054.
Title: Research on LSM-Tree Model Based Metadata Management Techniques
Role: Director
Fund: 100,000RMB

No.: 2018C01098.
Title: Distributed Platform for Intergrated Batch and Streaming Computing
Role: Director
Fund: 500,000RMB

Selected Publications

Zujie's DBLP entry is here.

Journal Articles:


Conference Papers:

Conference talks:

Invited Book Chapter:

Teaching

Technique Blogs(in Chinese)

Hadoop
Hive
HBase

个人工具
名字空间
变换
动作
导航
工具箱
账号