Tuesday, April 17, 2007

Millicomputer Based Load Balancers

If we build systems that contain hundreds of modules for web based applications, we need a way to manage the workflow distribution for incoming network traffic. Commercial load balancers cost more than millicomputing modules we want to send load to, so I've been looking around for open source projects that implement various kinds of load balancing. I have found a very good detailed summary article on this subject by Willy Tarreau, author of HAproxy, which he describes as:
HAProxy is a free, very fast and reliable solution offering high availability, load balancing, and proxying for TCP and HTTP-based applications. It is particularly suited for web sites crawling under very high loads while needing persistence or Layer7 processing. Supporting tens of thousands of connections is clearly realistic with today's hardware. Its mode of operation makes its integration into existing architectures very easy and riskless, while still offering the possibility not to expose fragile web servers to the Net...

At the http/application level, I found a description of a simple but powerful tool called balance.
Balance is our surprisingly successful load balancing solution being a simple but powerful generic tcp proxy with round robin load balancing and failover mechanisms. Its behaviour can be controlled at runtime using a simple command line syntax.
Another http load balancer that claims high performance and more features is XLB. It states
XLB is a high performance HTTP load balancer. connection management, caching, ssl, scripting. 300 mbit/sec / 4000 reqs/sec takes 30% cpu on a 2GhZ Xeon. connection pooling to backend servers reduces memory and cpu usage on backends.
One problem with load balancers, is that if they fail, a potentially large number of modules would be out of action. The Ultra Monkey load balancer addresses this issue.
Ultra Monkey 3 makes use of The Linux Virtual Server (LVS) to provide fast load balancing. The Linux-HA framework is used to monitor the linux-directors - the hosts running LVS and doing the load balancing. This is combined with ldirectord which monitors real-server - the hosts that accept end-user's connections. These three core components allow Ultra Monkey 3 to provide highly available and/or load balanced network services.
I haven't used any of these options, so I'm very interested to get recommendations, please comment if you have experience or alternatives to share, and I'll update this post.

In the array of modules scenario, I would dedicate a few modules to provide load balancing services. If the modules are all connected via Ethernet, then any module can be used. If we use the USB network then the central USB master that provides an Ethernet gateway is the natural place to install load balancer services.

Gumstix initial bringup

I bought a Gumstix Verdex GS-270, along with a small motherboard that has serial, USB and power connectors. For initial bringup I also installed Ubuntu 6.10 Linux on a PC to act as my development host. I've figured out how to get logged in to Linux on the Gumstix, and I'm documenting it step by step here.

This was my goal!
# uname -a
Linux gumstix 2.6.18gum #1 Wed Feb 28 18:05:43 PST 2007 armv5tel unknown
The basic sequence included getting at the serial port on the Dell, configuring it correctly, and figuring out which of the two serial ports on the motherboard has the console output.

  1. Download Ubuntu 600MB CD image - I used OSX, then used Disk Utility to burn it to a CD, and installed it on the PC, quite straightforward.
  2. Ubuntu doesn't include comms by default. I ran aptitude to search for programs and found a comms package that includes mincom and cu, so I installed both of them.
  3. The gumstix Wiki eventually revealed these setup instructions, which are to use mincom, turn off hardware and software flow control and set 115200-N-1 mode.
  4. This picture shows a similar motherboard, with the console port connected to the second serial port, which also worked for me.
  5. I connected the serial and USB cables, plugged in the power supply, and a small green LED glowed on the motherboard, nice confirmation that its on.
  6. After watching various boot messages, I logged in as root, with the initial password gumstix.
Sounds simple, but as usual, this took quite a while for me to figure out from scratch.... There are also instructions on how to interface and develop using Windows or OSX, but I wanted to do some comparative benchmarking of PC vs. ARM running similar releases of Linux 2.6.

Configuration messages at boot:

U-Boot 1.1.4 (Mar 1 2007 - 17:10:55) - PXA270@600 MHz - 1321

*** Welcome to Gumstix ***

U-Boot code: A3F00000 -> A3F25850 BSS: -> A3F5AE70
RAM Configuration:
Bank #0: a0000000 128 MB
Flash: 32 MB

.... some more messages then:

Linux version 2.6.18gum (craig@azazel) (gcc version 4.1.1) #1 Wed Feb 28 18:05:7
CPU: XScale-PXA270 [69054117] revision 7 (ARMv5TE), cr=0000397f
Machine: The Gumstix Platform
Memory policy: ECC disabled, Data cache writeback
Run Mode clock: 208.00MHz (*16)
Turbo Mode clock: 624.00MHz (*3.0, active)
Memory clock: 104.00MHz (/2)
System bus clock: 104.00MHz
CPU0: D VIVT undefined 5 cache
CPU0: I cache: 32768 bytes, associativity 32, 32 byte lines, 32 sets
CPU0: D cache: 32768 bytes, associativity 32, 32 byte lines, 32 sets
Interesting information on 32-way cache associativity, which I did not see mentioned in the specs.

The 32MB flash memory is mounted as a filesystem, with 8MB taken up by the default installation.

# df
Filesystem Size Used Available Use% Mounted on
/dev/mtdblock1 31.8M 8.0M 23.8M 25% /
The system supports IP networking over USB, which I have plugged in but I don't have working yet (its supposed to come up automatically, but doesn't). Thats next.

Thursday, April 12, 2007

Vertical and Horizontal Module Arrangements

Modules are available with edge connectors that can be mounted in bulk on a mother board as shown in the image below. The dimensions match the standard motherboard found in 1U Enterprise server designs, about 12x13 inches. The diagram shows 120 modules, but its quite likely to be possible to pack them in more densely than this.

The alternative is to mount modules flat on the boards as shown in the second diagram. This has the same 12x13 inch area, but is a very thin board, and at least four of them could be stacked in a 1U package, which also comes out to 120 modules.





In practice these board sizes and layouts will need to be adjusted to take into account the mechanical problems of flexing, mounting, cable routing etc. In each case the power and cooling management should be relatively simple, since there is a total peak power of around 100 watts for the entire 1U package, and no localized hot spots.

Some of the module designs have built-in temperature sensors and they all have power voltage sensors, so they can detect and report on environmental conditions across the motherboard.

Wednesday, April 11, 2007

Millicomputer Module Interconnects

There are two basic approaches.

One is to get modules that have ethernet built-in (or to add ethernet interfaces to a motherboard) and use ethernet switch chips such as the 8-24 port solutions from Vitesse to cluster the modules together. The individual modules would connect at 100Mbit, and the switches and external interfaces would interconnect at 1Gbit. The single chip ethernet switches have lots of features but can be run as unmanaged devices, so there is very little software needed to implement or manage the network. By directly connecting the networks on a motherboard there is no need to drive the full physical ethernet wire standard between the devices, saving a lot of power. These devices cost a few dollars a port, and dissipate about half a watt per port for fully driven gigabit links. if we can avoid using the Ethernet "PHY" (physical driver) a lot more power can be saved.

Another option is to use the built-in high speed USB2.0 interfaces which run at up to 480Mbit/s and connect them to a USB based central router that has ethernet support, then run IP over USB. This is a bit more complex to implement, but could be faster, lower power and cheaper since it uses an interface that is directly built into the millicomputer CPU. There are other kinds of devices like the AMCC PPC440EPx that are more PC-like, and have ethernet, PCI-bus and high speed USB built-in that could be used to implement a board level controller/router/interface. This device is more powerful than the mobile oriented millicomputer CPUs but dissipates about 3W so its in the next bracket up from a power consumption viewpoint.

PXA270 Module for testing

I just ordered a Gumstix GS270-XL6P module with 600MHz PXA270 and 128MB RAM. I'll run benchmarks on it, then end up building it into one of the mobile phone designs I'm working on. More later...

Thursday, April 5, 2007

Millicomputer Module Specifications

Here is another Google spreadsheet table of millicomputing module specifications.

There are several approaches, but some of these are edge connector based, include on-board ethernet, and could be stacked on a motherboard in a very dense array.

I think that on a standard 1U motherboard, if we could get five rows of 24 connectors that is 120 individual modules, using less than 100W maximum. The motherboard would just need to provide power and ethernet switch chips. If we also want per node storage, there are many very dense NAND flash ships in the multi-Gigabyte range that could be added to the design.

So is that interesting? I think so...

Tuesday, April 3, 2007

Millicomputer CPU Specifications

I've started a table of CPU specifications as a google spreadsheet.

I'm mostly interested in the CPU clock rate, CPU caches, RAM bandwidth and size.

All these devices are very flexible, and are mostly configured with relatively small amounts of memory for embedded applications. However, they have a decently fast clock rate, and can interface to at least two SDRAM chips. These chips are 32bits wide and currently contain a total of 128MB each. The CPUs support up to 256MB per chip, so the next generation SDRAM devices can double overall capacity.

Compared to current "enterprise CPUs" they are much slower than Opterons but probably comparable to a single thread on a Niagara.

The next comparison table I want to put together is for board level devices, such as the Gumstix range.