[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Money, mirroring



-----BEGIN PGP SIGNED MESSAGE-----


Hey Micah (& everyone else),

I want to preface my response in hopes that people will keep this in mind
as they read:

What is the goal/purpose of this site with regards to the "big picture"?
I was asked this by VA Research and NaviSite.  My response was: 

To always be the first thing mentioned in response to the statement:  
"There is no support for Linux."

If that is true it means that the LKB is:

1) Professional looking (worthy of showing to others)
2) Useful in regards to:
	- Plenty of quality information
	- Information is easy to access (able to find what you want)
	- The site is responsive (people aren't fustrated because it is
		too slow)
3) Stable.  It's "address" doesn't change and it's always up.
4) A contributing memeber to the Linux community.  We strengthen the ties
	between users and other users as well as to vendors.

On Tue, 22 Dec 1998, Micah Yoder wrote:

> "Aaron D. Turner" wrote:
> > Also, from my experiance, the "master" site gets 90% of the hits and the
> > mirrors get the scraps.  This is just because other webmasters are too
> > lazy to list multiple URL's for a site.  They just list the master.
> > User's are too lazy to "pick the nearest site", not to mention the fact
> > that "nearest" in Internet terms is a confusing proposition at best.
> 
> Here's what I think we should do:  have www.linuxkb.com be a "dummy"
> site with some basic information and links to mirrors, all in static
> HTML.  All mirrors should be accessible to us, and be under the
> linuxkb.org domain, and have geographic information.  Like, Navisite
> would be us-ca.linuxkb.org, Germany could have a de.linuxkb.org, Mexico
> could have mx.linuxkb.org, etc.

The problem is that geographical distance in Internet terms is a poor
measurement.  In CA. for example it's often faster to go to a site on the
east coast than one down the street (literally) because to reach the
server down the street I have to go through MAE-West which sucks 7am to
6pm.  This secenario is played at every public exchange (MAE/NAP). That
was a major reason working out the deal with NaviSite. They have purchased
private DS3 and above connections to MCI/Worldcom, BBN, Sprint, PSI, etc.
The result is that 90-95% of the internet is reachable from NaviSite
without going through a NAP or MAE.  NaviSite doesn't even connect to
public exchanges, they only have connections to tier 1 carriers.
 
> We would need an organization to give us hardware and bandwidth for each
> mirror.  I really think that would be easy to do.  Their logos would be
> displayed on their mirror.

More difficult than you think.  Hardware is relatively cheap.  Bandwidth
is expensive.  Good bandwidth is _very_ expensive.  Realize that NaviSite
is not doing this for us because they think they're going to get some
great advertising.  That's only secondary.  The primary reason is it is a
favor to me; probably because they want me to recommend that Vicinity
opens our east coast data center with them rather than Exodus.  

> Geographic diversity is our friend.  Yes, it will be a pain to set up,
> but that's what we're here for.

No.  Geographic diversity is our _ENEMY_.

1) GD (Geographic Diversity) makes distributing updates more difficult.  
More things can go wrong which prevents one or more servers from recieving
the updates.

2) GD makes directing clients to the "nearest" server much more
complicated.  How do you measure "nearest"?  Miles?  Latency?  Hop count?
How do you translate that into an easy and meaningful value to the user?

3) GD makes server redundancy difficult at best.  Even the $60K+ solutions
from Cisco and RND Networks are a "hack" at best. 

4) How do you get a user to another site when www.linuxkb.org goes down?
They're not going to remember de.linuxkb.org or il.linuxkb.org

5) What are we really trying to solve?  Geographic distribution or "load
distribution"?  These are two very different things.  GD sends traffic to
the "nearest" server.  LD sends traffic to the least utilized server.
Rarely are the two the same server.  Once you determine what the real
bottleneck is, you know what you need to solve for first.

> > The logo's along with the statement was part of the deal.  When I was
> > leading our group I felt that it was important to get a dedicated box with
> > excellent connectivity.  This seemed like a good/quick/easy way to
> > accomplish that.  It was also an effective motivator for us as group.
> 
> Yes... that kind of thing is good for us and for them.  Hopefully we can
> give them some business.  It's better than cash for us (no legal
> hassles) and good publicity for them.

Agreed it is by far better than cash.  Though the legal hassles are the
same.  I'm sure if the IRS found out, they'd tax us on the "value" of the
VAR box as a gift just like if you won a car on Wheel of Fortune.

> > the LDP!)  And even if we don't mirror/host another site, maybe we donate
> > $$$ to support a penguin at a zoo.  Maybe we pay someone to write an OSS
> > driver for USB or something.  The fact is that there's a lot of good we
> > could do with the $$$, beyond purchasing hardware for the site.
> 
> Whooaaahhh....  *That*'s going a bit beyond our "charter".  :-)  

What charter?  I haven't seen one (Well you and I did that old one way
long ago, but things have changed a lot since then.)  I'm just saying what
we can do with money; not necessarily that we should.  Maybe we should all
take an equal share of the money for the man months we're going to be
putting in. ;-)  [That's a joke.]

> Sure,
> money can be used for all kinds of good, but it was never a purpose of
> ours to make money, or even to do good.  :-)  Of course, our
> contribution will be a good place to find information on Linux.  That's
> what we're here to contribute, and I don't think we need to try to
> accomplish more.

So we should artifically limit ourselves in the amount of good we do?  If
we have the ability to do something positive for Linux, with virutally no
overhead, we should IMHO. The reason that OSS and Linux in particular has
succeeded is because people have not been willing to "let someone else do
that".  They've seen an opportunity to do some good and done it.  A
successful LKB site will do millions of page views a month- let's make it
worth something more than just helping people fix problems for which there
are existing solutions.
 
> > While I'll agree serving/mirroring the static HTML content is a breeze,
> > that is only a minor part of the overall utilization of the box.  The
> > CPU/disk hog won't be the static HTML, MySQL, Apache, or even the Perl
> > code we write, it's going to be Ht://dig.
> > 
> > Ht://dig uses a DB2 database which will be virutually unmirrorable (the DB
> > is generated via a spyder, hence incremental updates aren't likely to
> > work, even if we found a way to do it with DB2).  The DB will be 4 times
> > the size of the static HTML due to the fuzzly logic extentions.  And
> > realize that 90+% of users aren't going to "browse" the site.  They're
> > going to type a key word or two and click "search".  If we're lucky, for
> > every 5 pages served, only one will be a Ht://dig results page.
> 
> I don't know exactly how Ht://dig works, but I have a hard time
> believing it's unmirrorable.  It's open source, we can hack it. 

Does that mean you volunteer?  If so great and I'll shut-up.  (It's
uncompressed tarball is 9.6MB BTW... have fun. :-)  Understand that by
doing a distributed solution we are trying to make htdig do something it
was never intended to do.  It's not even designed to provide search
results on remote sites... it just doesn't scale like that.

> We
> would just need to write a program to do the updates, and run a daemon
> on the mirrors to listen for them.  It would work something like this:
> 
> When a new article is added, a new HTML page is generated, some SQL will
> happen to get it into the RDBMS, and ht://dig will need to know about

Actually the plan is that the article is inserted into the SQL DB via the
CGI, then a query is ran via cron and that generates the HTML page and
updates the index.

> it.  We'd probably use a special section of the main site,
> www.linuxkb.org, to add stuff.  It would then generate all the info
> needed and send it to all the mirrors, possibly by E-mail, maybe some
> other way.  The mirrors would run the same SQL command to get it into
> their databases, and ht://dig would be notified to add the page.
> 
> Does ht://dig have a way to add a page other than a spider search?  I
> hate spiders...  (the web kind...the 8 legged kind are kinda neat)

No, it doesn't.  Getting a web site indexed is a two step process.  First
you run htdig to scan the site (the spyder) and then you run htmerge to
create the index which htsearch uses for the client queries.  For more
info, see www.htdig.org
 
> This has *got* to be do-able.

No doubt it is.  The question is how many man weeks does it take to
impliment in a way that it scales and is easy to use?  Not to mention you
don't want to break compatibility with the existing ht://dig distribution.  
Do we really want to re-invent the wheel?  I think our limited time is
better spent elsewhere on the site than hacking someone elses C code to
create a distributed htdig-- especially since none of us have any
familarity with the C code for htdig.  It could be as simple as writing a
new subroutine or two, or as complicated as rewriting the whole thing from
scratch.

Yes, I know that no one else sees time as a scarce commodity on this
project, but the reality is we have a lot of work to do and only a few
people doing it.  If someone wants to find and manage 20 developers for
this site, then what are you waiting for?  But I can tell you this, it
will be more difficult to manage only 10 more developers than it will to
impliment banners and cash the resulting checks.  

> We really do need to decide on something soon.  We won't need mirrors
> for a while, but we do need to get the code in place to run them, so we
> can put them up easily when we need them and get donations.

Exactly.  You have to design a system with scalability in mind or it
won't.  It doesn't happen by accident and it's virtually impossible to do
later on as an add-on feature.

> > The only reason to mirror the site is if we can scale the search engine;
> > anything else is meaningless.  Having the boxes at a single location
> 
> We need to scale the search engine AND provide faster response for
> overseas users.

Do some network tests sometime.  Ping some machines in Japan, the UK,
Israel, and Austraila.  Notice the latency.  Now consider the time to
generate a results page for a boolean/fuzzy search with htdig for 1000
articles.  What is the bottleneck?  The three second htdig query or the
500ms round trip time for a packet?  The best way to provide a faster
response for everyone is to keep the htdig queries down to a reasonable
amount of CPU time.  That means either one BIG box or a bunch of small
ones acting like one big one.
 
> > behind a Local Director which gives the illusion of one big box makes
> > scaling the system a breeze.  Otherwise you have to convince others to run
> > and maintain a CPU/disk intensive search engine.  At that point it's no
> > longer a simple mirror site, it's almost a full blown replica and that's a
> > lot harder sell to volunteers.  And then you still have the failing
> 
> I think we'd still have control of everything.  Just like the Navisite,
> VAR deal, they'd give us the box and bandwidth and we'd do the rest.  It
> would *not* share a CPU with something else...someone would need to give
> us a box whole-hog.

I doubt that we would get very many offers like that.  We might get some
old P133's with 16-32MB of RAM, but what are we going to do with those?
Once you make the requirement that it has to be a dedicated box that we
maintain 100%, 95+% of the people who were willing to mirror won't.  Even
of we do get 10 offers for free PII 450Mhz with 1/2 GB RAM with OC12
connectivity it still doesn't solve the fact that you have to do all this
back end design work which prevents us from working on other parts of the
site.  And then someone has to manage it all.
 
> > proposition getting people to actually use the mirrors.  The result is
> > www.linuxkb.org is a massively overloaded site in 6-12 months no matter
> > how many satellite mirrors we have.
> 
> Not if we just make it a static set of links.  :-)

And then you're doing the same thing as the LDP which has a horrible
distribution method.  Not to mention the LDP is a far less complex site
than the one we're talking about building.  The LDP has no SQL back end
and no forms for people to submit articles.  It has a monolithic search
engine which is intentionally buried deep in the site so people can't
find/use it because if they did it would quickly get overloaded.  

The LDP uses the classic OSS web server scaling model and it just plain
sucks.  Take a look at any commercial site.  Nobody does it that way.  Ask
yourself "Why?".  It's not a cost issue.  It's the engineering time to
design a system that scales that way and the administration of a
distributed solution that prevent companies from doing it.  Hence, only
companies who's lifeblood *is* the Internet have multiple Internet data
centers.  And the fact is that only a small percentage of those actually
bother because it's not an effective means to increase performance to the
majority of users.

> > One last thing.  There is an advantage with corporate sponsorships-
> > immediate visibility outside the Linux community.  VA Research for
> > example, when we go live is going to do a press release on NewsWire about
> > us.  This means that we now have visibility to traditional media and those
> > who aren't connected to the Linux community.  Think about what that means
> > in terms of not just the number of visitors, but the diversity, and how
> > that helps promote Linux and pro-Linux compaines such as VA Research.
> 
> That's great!  They can do that because they're contributing hardware --
> we don't need to ask for cash.

Never said that we'd ask corporate types for money.  They wouldn't give it
anyways.  But we can sell them something they want (eyeballs) and get
money for that.

> > They can help us and we can help them and by helping each other we help
> > Linux.  As long as we keep that the purpose, money isn't a dividing issue
> > anymore, it becomes a tool.
> 
> Right, but it hasn't been proven that we need it yet.  My opnion of
> money is this:  We don't need it now, period.  We might in the future,
> and if that's the case, THEN we can go after it.  I really hope it

The problem is that when we need the money it will be too late to try to
earn the money.  We're better of with money in the bank when the rainy day
comes rather than trying to quickly earn a buck when the need comes.
I don't think we will have a rainy day any time soon, so we can probably
put it off for a little while.

> doesn't have to be with banner ads though.  That would cheapen the site
> a bit.  This really should look professional.

And a LDP distribution method looks professional?  Besides, have you
noticed that the majority of "professional/commercial" web sites have
banners?  And so do many Linux/OSS sites.  Banners have become so
prevelant that it has become the defacto-standard for any kind of site.
There's little if any public backlash anymore.  And we don't have to do
banners... remember I've found two other reveue streams: books and the BF.

Talk to ya at 7/10EST.

- -- 
Aaron Turner           | Either which way, one half dozen or another. 
aturner@pobox.com      | Check out the Red Hat Linux User's FAQ Online!
www.pobox.com/~aturner | http://www.pobox.com/~aturner/RedHat-FAQ/
All emails from this account are PGP signed.  Lack of a signature is "bad".
PGP Key fingerprint = FB E1 CE ED 57 E4 AB 80  59 6E 60 BF 45 1B 20 E8



-----BEGIN PGP SIGNATURE-----
Version: 2.6.2

iQCVAwUBNoBVqDM3jpXy1kJtAQFajAP/U9dlW6CCV9Diz30IX3H/dWhXAcKkfjt8
TZO/PVd3QMwH2lPPF3eW4Oe13LaLIrK5LZyJNLGr+3cyvBPBepNNx8FkhRnJjxpr
P/JTOT0z5sC9KzHTp5My5PZ4M7jGTJ19wrCla7aIxUVdUOur3BtSq16QPSR5xAxB
H0eEkP9Jm1Y=
=W0o+
-----END PGP SIGNATURE-----