High scalability

High scalability

Loc 2695

Obsticles:

reads
writes
data size
data complexity
response time
access patterns

Instruments to solve:

NoSQL
message queues
caches
search indexes
batch and stream processing frameworks

Load

Checking what happens when increasing load:

increase load, keep the system resources unchanged, see how is the system performance affected
increase load, see how much resources you need to add to keep the system performance unchanged

Describing performance:

throughput (number of records/requests we process per some period of time)
response time (the time between request was sent and response was received)

CPU

Many applications today are data-intensive. Raw CPU power is rarely a limiting factor.

Queues

Help to handle spikes, scale horizintally and make system more reliable.

See Feeds.

Big data?

Big data - if amount of data or resources to process it is the current system limit.

Throughput:

low = <100/s
medium = <5000/s
high = >5000/s

Numbers:

Airbnb, 100k messages being sent on mobile per hour

High Scalability: Building bigger, faster, more reliable websites Data Pipeline Architect - Resources to help you with data planning and plumbing Why You Shouldn’t Build Your Own Data Pipeline Spark talk on PyCon Ukraine 2017 by Taras Lehinevych

Vocabulary

Data-intensive applications

Limiting factors are the amount of data, the complexity of data, the speed at which it is changing.

Late 1980s and early 1990s there was a trend to use a separate database for analytics. Safe ti run queries those often harm performance of concurrently executing transactions in the main database if running there.

There is also Data Lake.

Compute-intensive application

Where CPU cycles are the bottleneck.

Stream processing

Send a message to another process, to be handled asynchronously.

Batch processing

Periodically crunch a large amount of accumulated data.

ETL

Extract-Transform-Load - a process of getting data into a data warehouse.

Reliability

The system should continue to work correctly even in the face of adversity (hardware or software faults).

Scalability

As the system grows, there should be reasonable ways of dealing with that growth.

Vertical scaling (scaling up) - moving to a more powerful machine. Horizontal scaling (scaling out) - distributing the load across multiple machines.

Be pragmatic

Using several fairly powerful machines can still be simpler and cheaper than a large number of small virtual machines.

High scalability

Load

CPU

Queues

Big data?

Vocabulary

Data-intensive applications

Data warehouse

Compute-intensive application

Stream processing

Batch processing

ETL

Reliability

Scalability

Maintainability

Latency vs response time

MapReduce

Links