Big Data

From CompBio
Jump to: navigation, search

Contents

Introduction

This page will be dedicated to dealing with 'big data.'

Many books ship with CD's, might be interesting to include cloudera's virtual machine.

As I write other articles that prove relevant to this article I'll create links.

What Hadoop is

batch processing system

What Hadoop isn't

Realtime platform

HDFS

Hadoop Distributed File System. HDFS is at the core of hadoop.

Introduction to other Apache projects born out of Hadoop (HBase, hive, Avro)

OK, now what?

Setting proper expectations

Computing Models: Single threaded, Distributed and Parallel

clouds private/public

Challenges

Searching

Indexing

Installing Packages

hadoop

configuration settings

hbase

WAL settings

Processing Stratgies

Identifying embarrassingly parallel work

Exraction

Input record readers/Scanners

Transforming

Loading

Hive

HBase

Sqoop

sensei/kata

Lucene

Input record writers/PrintStreams

Common Approaches/Pitfalls

MySQL indexing

Single Threaded Processing Idioms

Grep, a utility of the past. Meet HGrep

Migration Tasks

Streaming

Algorithm Migration

Mathematical Averages

Examples of algorithms in both models

Understanding Comparables

Partitioners

Total order

Quality Control

MapReduce Unit tests

Customizing your environment

Development Environment

Virtual Machines

Test clusters

Optimizing Configuration

User land settings (limits.conf)

Kernel settings

Cluster Sizing

Commodity vs Brand name

Monitoring

Network Monitoring

Network Monitoring

Performance Monitoring

Best Practices

QA

Distributed Cache

Source Code (not sure about this section)

git

svn

Tutorials

examples.jar

Security

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox