aisconvertor
[info]siberean
On SourceForge:
http://sourceforge.net/projects/aisconvert/files/

Discussions:
http://forum.molgen.org/index.php/topic,934.60.html

23andme complete list:
http://napobo3.lk.net/dna/23andmesnps.zip

HapMap:

http://hapmap.ncbi.nlm.nih.gov/downloads/genotypes/00README.txt <== format
http://hapmap.ncbi.nlm.nih.gov/downloads/genotypes/?N=D <== data

HIR search:
http://the2.3utilities.com:8080/22/

Volga
[info]siberean
http://static.panoramio.com/photos/original/762412.jpg
http://static.panoramio.com/photos/original/18119271.jpg

http://static.panoramio.com/photos/original/23916900.jpg
http://static.panoramio.com/photos/original/13434804.jpg
http://static.panoramio.com/photos/original/13471218.jpg
http://static.panoramio.com/photos/original/15702721.jpg




Lucene links
[info]siberean
CIA strategic investment into Lucene:
http://www.iqt.org/news-and-press/press-releases/2009/Lucid_Imagination_06-15-09.html

http://news.cnet.com/8301-13505_3-10288143-16.html

one of lectures about Lucene:
http://www.fosslc.org/drupal/node/473

Lucene Wikis in general and performance in particular:
http://wiki.apache.org/lucene-java/FrontPage?action=show&redirect=FrontPageEN
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
another discussion with numbers:
http://www.gossamer-threads.com/lists/lucene/java-user/57213

tutorial
http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html

A wrapper around Lucene (for manual indexing and auto-indexing of file hierarchies):
http://sourceforge.net/projects/ais/

DataMining links:
http://wiki.apache.org/lucene-java/InformationRetrieval

Screenshots under Windows XP (Linux ones are on sourceforge):
siberean.wikispaces.com/AIS+screenshots

Tags:

Associative Indexing Service
[info]siberean
AIS - Associative Indexing Service, an application for storing bookmarks, memos, indexing of big (lifetime) archives for fast future access to the data by (personalized) keywords. In other words - it is an extension of human associative memory :)

sourceforge.net/projects/ais/

siberean.wikispaces.com/AIS+screenshots

http://freshmeat.net/projects/ais-associative-indexing-service


Grouping and storing personal information in Internet as in a global 'Cloud'
[info]siberean

Status of This Memo
 This document proposes a method to use public Internet as a global infinite
 storage for personal information and a way to group and separate a public
 personal information from other information.
 
 
Copyright Notice
 Copyright (C) The Internet Society (2005).

         This document and translations of it may be copied and
         furnished to others, and derivative works that comment on or
         otherwise explain it or assist in its implmentation may be
         prepared, copied, published and distributed, in whole or in
         part, without restriction of any kind, provided that the above
         copyright notice and this paragraph are included on all such
         copies and derivative works.  However, this document itself may
         not be modified in any way, such as by removing the copyright
         notice or references to the Internet Society or other Internet
         organizations, except as needed for the  purpose of developing
         Internet standards in which case the procedures for copyrights
         defined in the Internet Standards process must be followed, or
         as required to translate it into languages other than English.

         The limited permissions granted above are perpetual and will
         not be revoked by the Internet Society or its successors or
         assigns.

         This document and the information contained herein is provided
         on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET
         ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR
         IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE
         OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY
         IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A
         PARTICULAR PURPOSE."

 

 

 Grouping and storing personal information in Internet as in a global 'Cloud'                

V.Gavrilov                               June 2009

 
 The Internet today (so called WEB2) is more and more used for decentralized
 storing and editing of articles, comments, blogs, bulletins, wikis which are
 hosted on different public sites. But there
 is no a standard way of extracting all the information published by a user,
 grouping and searching of such information, an easy way of separating
 ones personalized information from the miriads of published articles by
 other users.
 This grouping exists only in editor's head as personal associations
 and memory. With time passed by - some publishings are lost, URLs and
 domains may change and boookmarks are expiring, so the only way to find
 one's published information is to search the Internet again and ... try
 to find one's information by hand - from millions of results returned by a
 search engine, narrowing requests and searches.
 
 Proposed here is a simple method of grouping personal information by attaching
 of a signature to every published message where the signature is a short
 result from a one-way hash-function generated from a combination of a user's
 name, date of birth and a personalized message - to avoid collisions.
 To avoid confusion with the term "digital signature" from asymmetric
 cryptography - let's from now on name our signature/hash as NID (Network ID).
 
 
 A scenario could look like the following: while browsers still not supporting
 NID - a user could temporarily copy/paste NID during every personal post.
 In the future - user will be able to query a favourite search engine for a NID and will get
 only his personal stuff - been separated from the rest of Internet information
 because NID forms a word generated from a personal unique information which
 will highly unlikely occur somewhere else.
 In future - browsers or search engine could support a checkbox: "Personal/Public"
 and selecting of this checkbox - will allow to extract only the personal
 information from the Internet, the user ever published (using this NID).
 This is a simplest scenario. More complex usages of different groupings of
 personal information (forming multiple groups inside one personalized group by
 means of attached to the NID keywords) - may be easily shown.
 
 Let's consider a simple implementation of such NID generator.
 It is obvious that the compactness of NID is highly desired. And it seems
 that for such purpose - even 128-bit MD5 algorithm can be used successfully.
 (Actually, the published vulnerabilities of MD5 on collisions will not have
 impact on this particular usage due to the predefined and short string, from
 which the key is generated and secondly - even if a collision will occur - it will
 not have a big impact, so will not be such critical). From another side -
 22-char string hash result (base64 encoded 128-bit binary output from MD5) is
 short enough - to be stored extensively in the Internet in every post or article.
 
 Generating of a personalized NID may be hypotetically demonstrated by the following
 UNIX command * (it will be actually longer that 22-char since the standard built-in
 md5 uses less compact than base64 encoding: HEX):
 
 % echo "Vasili Gavrilov 01011968 my cat's name - Kuzma" | md5
 % XQgYnkgYSBwZXJzZXZlcm

 where "my cat's name - Kuzma" is a seed/salt, added for avoiding collisions
 of multiple persons having the same name and were born on the same date and also -
 to avoid generating of the NID by another person - to extract somebody else's
 information (if the name, DOB and protocol of NID generating are known).


 What should be noted here is that there should be at least minimal protocol of
 what fields are to be used and in which sequence - for generating NID - to avoid
 collisions by using of too simple feeds into md5+base64 combination.
 This RFC is intended for begining of discussing of this convention.
 
 For example - the protocol could require to write first name, last name,
 Date of Birth and "salt" - in this order (in any case, with any delimiter or
 vice versa - with predefined delimiter and casing - TBD. Benefits of that?).
 
 
 An extension of the protocol could be an attaching of a personal keyword or
 an association - to the NID.
 For example, when storing something connected with photo - the user could attach
 "photo" at the end of the NID:
  XQgYnkgYSBwZXJzZXZlcmphoto
  
 and in future - searching the Internet for this string will give a user all
 the entries ever stored with this key.
 Storing of multiple keys NIDs with multiple keywords will allow to create
 arbitrary groups and search engines will do their regular job for intersecting
 of the groups.
 
 What should be noted here is that it will be hard for another person - to get
 somebody else's information due to irreversibility of the hash function and
 existence of the 'salt' acting as a 'password'.
 No one is restricted to use more than one NID, so this is very different
 from assigning of NID to every user forever and so - this seems to be very
 privacy-friendly approach also.
 
 In future - browsers (or search engines) may support transparent appending of NID
 to the requests and searching for the past personal postings connected with "photo" and
 "vacation" will be able to achieve by just entry in a browser:
 "photo vacation" - as it is done currently against common data.
 Since above-mentioned checkbox "Personal" will be checked-in - a browser will attach
 locally saved (or saved in a Cookie or a session) NID and will send a more restricted
 request returning only the user personal postings (from multiple sites) containing
 both keywords "photo" and "vacation".
 
 We could imagine other extensions such as attaching of a counter or
 another id at the end - to allow saving of redundant (the same) data into
 multiple sites and for easier distinguishing of the duplicated data in the browser.
 This can be further elaborated.
 
 
 The above-mentioned procedure allows to use public internet as an infinite storage
 of personal data and easy extraction and grouping of such data and separating
 of public data and data saved by a person. This also transforms saving of the data
 into the Internet into saving into one global 'Cloud' and abstracts the location
 (URL may change but the information will still remain searchable).
 This will allow the personal data to be distributed equally on multiple public
 storages and in future - possibly to organize personal distributed services, working
 with personal data in really parallel way.
 
 

XQgYnkSBwZXZXZlcm

*)  A reference tool for generating such signature is here: http://sourceforge.net/projects/nid/


Scratchpad
[info]siberean
bonnie on my P4/1400:

Version 1.03c       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
anode         1504M  5677  51 25452  31 17245  19  9652  56 49905  20 292.1   5
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  4147  32 32462  22  6747  31  5210  38 +++++ +++ 10965  46
anode,1504M,5677,51,25452,31,17245,19,9652,56,49905,20,292.1,5,16,4147,32,32462,22,6747,31,5210,38,+++++,+++,10965,46


bonnie on ARM (SheevaPlug):
Version 1.03c       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
debian           1G  3152  99  8606  80  6942  89  4999  99 28059  99  1490  96
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16   562  59 +++++ +++   937  99   864  94 +++++ +++   879 100
debian,1G,3152,99,8606,80,6942,89,4999,99,28059,99,1490.4,96,16,562,59,+++++,+++,937,99,864,94,+++++,+++,879,100



Default SheevaPlug provides Samba (I've just started it)

user@anode:~$ smbtree
Password:
WORKGROUP
    \\DEBIAN                 debian server (Samba, Ubuntu)
        \\DEBIAN\IPC$               IPC Service (debian server (Samba, Ubuntu))
        \\DEBIAN\Media              Media


some posted info:
http://www.linux.org.ru/view-message.jsp?msgid=3672922&page=2#3676699

Experimenting with ARM-based SheevaPlug
[info]siberean
Connected the device to the power line and RJ45 - to the router.
New IP appeared on the network and was able to connect through ssh using root/nosoup4u.

Measuring the time from powering on until network is up (on first response I'm killing ping process):

$ time ping 192.168.0.104
PING 192.168.0.104 (192.168.0.104) 56(84) bytes of data.
From 192.168.0.101 icmp_seq=90 Destination Host Unreachable
From 192.168.0.101 icmp_seq=91 Destination Host Unreachable
. . .
64 bytes from 192.168.0.104: icmp_seq=92 ttl=64 time=4.33 ms
^C
--- 192.168.0.104 ping statistics ---
92 packets transmitted, 1 received, +60 errors, 98% packet loss, time 91094ms
rtt min/avg/max/mdev = 4.338/4.338/4.338/0.000 ms, pipe 3

real    1m31.997s
user    0m0.000s
sys    0m0.004s
user@anode:~$

So, the device boots in ~1.5 minutes (I've measured twice - second time it was 1m29s)


root@debian:/mnt/tmp# cat /proc/cpuinfo
Processor    : ARM926EJ-S rev 1 (v5l)
BogoMIPS    : 1192.75
Features    : swp half thumb fastmult edsp
CPU implementer    : 0x56
CPU architecture: 5TE
CPU variant    : 0x2
CPU part    : 0x131
CPU revision    : 1
Cache type    : write-back
Cache clean    : cp15 c7 ops
Cache lockdown    : format C
Cache format    : Harvard
I size        : 16384
I assoc        : 4
I line length    : 32
I sets        : 128
D size        : 16384
D assoc        : 4
D line length    : 32
D sets        : 128

Hardware    : Feroceon-KW
Revision    : 0000
Serial        : 0000000000000000


root@debian:/mnt/tmp# free
             total       used       free     shared    buffers     cached
Mem:        515636     100112     415524          0          0      80384
-/+ buffers/cache:      19728     495908
Swap:            0          0          0

root@debian:~# dmesg
Linux version 2.6.22.18 (dhaval@devbox) (gcc version 4.2.1) #1 Thu Mar 19 14:46:22 IST 2009
CPU: ARM926EJ-S [56251311] revision 1 (ARMv5TE), cr=00053177
Machine: Feroceon-KW
Using UBoot passing parameters structure
Memory policy: ECC disabled, Data cache writeback
On node 0 totalpages: 131072
  DMA zone: 1024 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 130048 pages, LIFO batch:31
  Normal zone: 0 pages used for memmap
CPU0: D VIVT write-back cache
CPU0: I cache: 16384 bytes, associativity 4, 32 byte lines, 128 sets
CPU0: D cache: 16384 bytes, associativity 4, 32 byte lines, 128 sets
Built 1 zonelists.  Total pages: 130048
Kernel command line: console=ttyS0,115200 mtdparts=nand_mtd:0x400000@0x100000(uImage),0x1fb00000@0x500000(rootfs) rw root=/dev/mtdblock1 rw ip=10.4.50.4:10.4.50.5:10.4.50.5:255.255.255.0:DB88FXX81:eth0:none
PID hash table entries: 2048 (order: 11, 8192 bytes)
Console: colour dummy device 80x30
Dentry cache hash table entries: 65536 (order: 6, 262144 bytes)
Inode-cache hash table entries: 32768 (order: 5, 131072 bytes)
Memory: 256MB 256MB 0MB 0MB = 512MB total
Memory: 515456KB available (3864K code, 257K data, 104K init)
Calibrating delay loop... 1192.75 BogoMIPS (lpj=5963776)
Mount-cache hash table entries: 512
CPU: Testing write buffer coherency: ok
NET: Registered protocol family 16

CPU Interface
-------------
SDRAM_CS0 ....base 00000000, size 256MB
SDRAM_CS1 ....base 10000000, size 256MB
SDRAM_CS2 ....disable
SDRAM_CS3 ....disable
PEX0_MEM ....base e8000000, size 128MB
PEX0_IO ....base f2000000, size   1MB
INTER_REGS ....base f1000000, size   1MB
NFLASH_CS ....base fa000000, size   2MB
SPI_CS ....base f4000000, size  16MB
BOOT_ROM_CS ....no such
DEV_BOOTCS ....no such
CRYPT_ENG ....base f0000000, size   2MB

  Marvell Development Board (LSP Version KW_LSP_4.2.7_patch2)-- SHEEVA PLUG  Soc: 88F6281 A0 LE

 Detected Tclk 200000000 and SysClk 400000000
MV Buttons Device Load
Marvell USB EHCI Host controller #0: c08b8600
PEX0 interface detected no Link.
PCI: bus0: Fast back to back transfers enabled
SCSI subsystem initialized
usbcore: registered new interface driver usbfs
usbcore: registered new interface driver hub
usbcore: registered new device driver usb
NET: Registered protocol family 2
Time: kw_clocksource clocksource has been installed.
IP route cache hash table entries: 16384 (order: 4, 65536 bytes)
TCP established hash table entries: 65536 (order: 7, 524288 bytes)
TCP bind hash table entries: 65536 (order: 6, 262144 bytes)
TCP: Hash tables configured (established 65536 bind 65536)
TCP reno registered
RTC registered
Use the XOR engines (acceleration) for enhancing the following functions:
  o RAID 5 Xor calculation
  o kernel memcpy
  o kenrel memzero
Number of XOR engines to use: 4
cesadev_init(c000c894)
mvCesaInit: sessions=640, queue=64, pSram=f0000000
Warning: TS unit is powered off.
MV Buttons Driver Load
NTFS driver 2.1.28 [Flags: R/O].
JFFS2 version 2.2. (NAND) © 2001-2006 Red Hat, Inc.
io scheduler noop registered
io scheduler anticipatory registered (default)
Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing disabled
serial8250.0: ttyS0 at MMIO 0xf1012000 (irq = 33) is a 16550A
serial8250.0: ttyS1 at MMIO 0xf1012100 (irq = 34) is a 16550A
Loading Marvell Ethernet Driver:
  o Cached descriptors in DRAM
  o DRAM SW cache-coherency
  o Single RX Queue support - ETH_DEF_RXQ=0
  o Single TX Queue support - ETH_DEF_TXQ=0
  o TCP segmentation offload enabled
  o Receive checksum offload enabled
  o Transmit checksum offload enabled
  o Network Fast Processing (Routing) supported
  o Driver ERROR statistics enabled
  o Driver INFO statistics enabled
  o Proc tool API enabled
  o Rx descripors: q0=128
  o Tx descripors: q0=532
  o Loading network interface(s):
    o eth0, ifindex = 1, GbE port = 0
    o eth1, ifindex = 2, GbE port = 1

mvFpRuleDb (dfd00000): 16384 entries, 65536 bytes
Intel(R) PRO/1000 Network Driver - version 7.3.20-k2-NAPI
Copyright (c) 1999-2006 Intel Corporation.
e100: Intel(R) PRO/100 Network Driver, 3.5.17-k4-NAPI
e100: Copyright(c) 1999-2006 Intel Corporation

Warning Sata is Powered Off
NFTL driver: nftlcore.c $Revision: 1.98 $, nftlmount.c $Revision: 1.41 $
NAND device: Manufacturer ID: 0xad, Chip ID: 0xdc (Hynix NAND 512MiB 3,3V 8-bit)
Scanning device for bad blocks
Bad eraseblock 324 at 0x02880000
Bad eraseblock 332 at 0x02980000
Bad eraseblock 340 at 0x02a80000
Bad eraseblock 348 at 0x02b80000
Bad eraseblock 356 at 0x02c80000
Bad eraseblock 364 at 0x02d80000
Bad eraseblock 372 at 0x02e80000
Bad eraseblock 380 at 0x02f80000
Bad eraseblock 2372 at 0x12880000
Bad eraseblock 2380 at 0x12980000
Bad eraseblock 2388 at 0x12a80000
Bad eraseblock 2396 at 0x12b80000
Bad eraseblock 2404 at 0x12c80000
Bad eraseblock 2412 at 0x12d80000
Bad eraseblock 2420 at 0x12e80000
Bad eraseblock 2428 at 0x12f80000
Bad eraseblock 3088 at 0x18200000
Bad eraseblock 3636 at 0x1c680000
Bad eraseblock 3637 at 0x1c6a0000
Bad eraseblock 3644 at 0x1c780000
Bad eraseblock 3645 at 0x1c7a0000
Bad eraseblock 3646 at 0x1c7c0000
Bad eraseblock 3647 at 0x1c7e0000
Bad eraseblock 3648 at 0x1c800000
Bad eraseblock 3684 at 0x1cc80000
2 cmdlinepart partitions found on MTD device nand_mtd
Using command line partition definition
Creating 2 MTD partitions on "nand_mtd":
0x00100000-0x00500000 : "uImage"
0x00500000-0x20000000 : "rootfs"
ehci_marvell ehci_marvell.70059: Marvell Orion EHCI
ehci_marvell ehci_marvell.70059: new USB bus registered, assigned bus number 1
ehci_marvell ehci_marvell.70059: irq 19, io base 0xf1050100
ehci_marvell ehci_marvell.70059: USB 2.0 started, EHCI 1.00, driver 10 Dec 2004
usb usb1: configuration #1 chosen from 1 choice
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 1 port detected
ohci_hcd: 2006 August 04 USB 1.1 'Open' Host Controller (OHCI) Driver
USB Universal Host Controller Interface driver v3.0
usbcore: registered new interface driver usblp
drivers/usb/class/usblp.c: v0.13: USB Printer Device Class driver
Initializing USB Mass Storage driver...
usbcore: registered new interface driver usb-storage
USB Mass Storage support registered.
mice: PS/2 mouse device common for all mice
i2c /dev entries driver
Linux telephony interface: v1.00
Marvell Telephony Driver:
mvBoardVoiceAssembleModeGet: TDM not supported(boardId=0x9)
assembly=-1,irq=-1
mp_check_config: Error, invalid voice assembley mode
md: linear personality registered for level -1
md: raid0 personality registered for level 0
md: raid1 personality registered for level 1
raid6: int32x1     97 MB/s
raid6: int32x2    114 MB/s
raid6: int32x4    122 MB/s
raid6: int32x8    110 MB/s
raid6: using algorithm int32x4 (122 MB/s)
md: raid6 personality registered for level 6
md: raid5 personality registered for level 5
md: raid4 personality registered for level 4
raid5: measuring checksumming speed
   arm4regs  :  1071.600 MB/sec
   8regs     :   754.800 MB/sec
   32regs    :   899.600 MB/sec
raid5: using function: arm4regs (1071.600 MB/sec)
device-mapper: ioctl: 4.11.0-ioctl (2006-10-12) initialised: dm-devel@redhat.com
dm_crypt using the OCF package.
sdhci: Secure Digital Host Controller Interface driver
sdhci: Copyright(c) Pierre Ossman
mvsdmmc: irq =28 start f1090000
mvsdmmc: no IRQ detect
usbcore: registered new interface driver usbhid
drivers/hid/usbhid/hid-core.c: v2.6:USB HID core driver
Advanced Linux Sound Architecture Driver Version 1.0.14 (Thu May 31 09:03:25 2007 UTC).
mvCLAudioCodecRegGet: Error while reading register!
mvCLAudioCodecInit: Error - Invalid Cirrus Logic chip/rev ID!
Error - Cannot initialize audio decoder.at address =0xff<6>ALSA device list:
  #0: Marvell mv88fx_snd ALSA driver
TCP cubic registered
NET: Registered protocol family 1
NET: Registered protocol family 17
eth0: started
IP-Config: Complete:
      device=eth0, addr=10.4.50.4, mask=255.255.255.0, gw=10.4.50.5,
     host=DB88FXX81, domain=, nis-domain=(none),
     bootserver=10.4.50.5, rootserver=10.4.50.5, rootpath=
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
eth0: link up, full duplex, speed 100 Mbps
eth0: link down
eth0: link up, full duplex, speed 100 Mbps
Empty flash at 0x088dd338 ends at 0x088dd800
VFS: Mounted root (jffs2 filesystem).
Freeing init memory: 104K
fat: exports duplicate symbol fat_add_entries (owned by kernel)

root@debian:~# who am i
root     pts/0        2000-01-24 21:57 (192.168.0.101)
root@debian:~#


Now trying to connect through debug console (mini USB port) - through minicom.

on the host Linux machine:
$ modprobe ftdi_sio vendor=0x9e88 product=0x9e8f

for control:
$ lsmod | grep ftdi
ftdi_sio               55944  1
usbserial              39528  3 ftdi_sio
usbcore               149488  6 ftdi_sio,usbserial,ohci_hcd,ehci_hcd,uhci_hcd

As said in the manual:
$minicom -s
but instead of /dev/ttyS0 - setting to /dev/ttyUSB1

$minicom -e etc


after that I've got login and entered root/nosoup4u:


$ minicom -s

Welcome to minicom 2.3                                                   
                                                                         
OPTIONS: I18n                                                               
Compiled on Oct 24 2008, 06:37:44.                                          
Port /dev/ttyUSB1                                                           
                                                                            
                 Press CTRL-A Z for help on special keys                    
                                                                            
                                                                            
Ubuntu jaunty (development branch) debian ttyS0                             
                                                                            
debian login: AT S7=45 S0=0 L1 V1 X4 &c1 E1 Q0                              
Password:
Last login: Mon Jan 24 18:52:17 UTC 2000 from 192.168.0.101 on pts/0           
Linux debian 2.6.22.18 #1 Thu Mar 19 14:46:22 IST 2009 armv5tejl               
                                                                               
The programs included with the Ubuntu system are free software;                
the exact distribution terms for each program are described in the             
individual files in /usr/share/doc/*/copyright.                                
                                                                               
Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by           
applicable law.                                                                
                                                                               
To access official Ubuntu documentation, please visit:                         
http://help.ubuntu.com/                                                        
1 failure since last login.                                                    
Last was Mon Jan 24 21:32:17 2000 on ttyS0.                                    

root@debian:~# who am i                                                        
root     ttyS0        Jan 24 21:33                                             

root@debian:~# uname -a                                                        
Linux debian 2.6.22.18 #1 Thu Mar 19 14:46:22 IST 2009 armv5tejl GNU/Linux    

root@debian:~# apt-get update
E: Archive directory /var/cache/apt/archives/partial is missing.

root@debian:~# mkdir -p /var/cache/apt/archives/partial

now it is fine:

root@debian:~# apt-get update
Get:1 http://ports.ubuntu.com jaunty Release.gpg [189B]
Get:2 http://ports.ubuntu.com jaunty/main Translation-en_CA [2731B]
Get:3 http://ports.ubuntu.com jaunty/restricted Translation-en_CA [3970B]
Ign http://ports.ubuntu.com jaunty/universe Translation-en_CA        
Ign http://ports.ubuntu.com jaunty/multiverse Translation-en_CA
Get:4 http://ports.ubuntu.com jaunty Release [74.6kB] 
Get:5 http://ports.ubuntu.com jaunty/main Packages [1234kB]
Get:6 http://ports.ubuntu.com jaunty/restricted Packages [865B]               
Get:7 http://ports.ubuntu.com jaunty/universe Packages [4442kB]               
Get:8 http://ports.ubuntu.com jaunty/multiverse Packages [159kB]              2
Fetched 5917kB in 60s (97.9kB/s)                                             
Reading package lists... Done
root@debian:~#

Trying to nfs mount main linux host:

root@debian:~# mount 192.168.0.101:/ /mnt/tmp/
mount: wrong fs type, bad option, bad superblock on 192.168.0.101:/,
       missing codepage or helper program, or other error
       (for several filesystems (e.g. nfs, cifs) you might
       need a /sbin/mount.<type> helper program)
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

so nfs client is not installed

root@debian:~# apt-get install nfs-common portmap

root@debian:~# mount 192.168.0.101:/ /mnt/tmp/
root@debian:~# ls /mnt/tmp
bin   initrd.img      lib   mnt   root  sys  var
boot   dev     etc    initrd.img.old  lost+found  opt   sbin  tmp  vmlinuz
cdrom   home   media       proc  srv   usr  vmlinuz.old

(next time we'll do nfs boot)

A little bit NFS performance (/mnt/tmp is mounted Linux host):

From the device to Linux host:
root@debian:/mnt/tmp# dd if=/dev/zero of=/mnt/tmp/tmp/testfile bs=1024 count=100000
100000+0 records in
100000+0 records out
102400000 bytes (102 MB) copied, 9.22229 s, 11.1 MB/s

From Linux host to me:
root@debian:/mnt/tmp# dd if=/mnt/tmp/tmp/testfile of=/tmp/testfile bs=1024 count=100000
100000+0 records in
100000+0 records out
102400000 bytes (102 MB) copied, 44.1235 s, 2.3 MB/s
root@debian:/mnt/tmp#

which is obviously limited by slow flash memory:
from local to local:
root@debian:/mnt/tmp# dd if=/dev/zero of=/tmp/testfile bs=1024 count=100000
100000+0 records in
100000+0 records out
102400000 bytes (102 MB) copied, 37.7241 s, 2.7 MB/s

now from remote to remote:
root@debian:/mnt/tmp# dd if=/mnt/tmp/tmp/testfile of=/mnt/tmp/tmp/testfile2 bs=1024 count=100000
100000+0 records in
100000+0 records out
102400000 bytes (102 MB) copied, 9.46773 s, 10.8 MB/s

which is pretty good.


Now - let' employ the device as rsynching master host, making incremental synchronization of data between old Pentium MMX and Pentium 2 machines (before I used Linux host, consuming a lot of energy)
root@debian:~# apt-get install rsync
. . .
root@debian:~# scp 192.168.0.12:/ISROOT/1/backup.sh
root@debian:~# mkdir ISROOT

...have some problem with mapping between users (write permission)

root@debian:~# apt-get install openjdk-6-jdk

root@debian:~# apt-get install xorg
root@debian:~# apt-get install xterm
root@debian:~# apt-get install fluxbox
root@debian:~# apt-get install vnc-server

dpkg-reconfigure xserver-xorg

respond "Yes" to framebuffer, all other are default.

REST (as command-line of the web) vs web-services
[info]siberean
"EST isn’t really about human-to-system or system-to-system. It’s about sheer consumer scalability through the separation of operations, resource identifiers, and resources, and enforcing uniformity across both operations and resource identifiers"

"WSDL 2.0 will not solve the problem — it’s designed to describe specific operations on an HTTP method + URI pair, it’s not designed to describe a set of a few generic operations (GET/PUT/DELETE/POST) on a complex URIspace. It makes URIs second citizens, and thus is not very ‘webby’."


http://www.noahcampbell.info/2006/07/10/soa-vs-rest/

C hash vs C++ Map vs C++ Hash Map vs java HashMap
[info]siberean
Test demonstrating that spending a day programming a custom optimized hash (as opposed with blind using of the default STL Map or HashMap which appeared to be even slower) may be worth (Especially in projects where such operations are critical: in a rule-engine, expert-system, in-memory index etc). Sure, it is possible to play with C++ code, playing with the algorithm, making it C-like code, but the code will become even less obvious and longer than C (so why to bother at all?). C is simpler language (the whole definition of the language is fitting into K&R thin book) - much more compact than C++, so there are much less artifacts, additional things to know and so - less bugs. Not speaking about overheads of C++.


$ g++ -O2 -o map map.cpp
$ ./map
records loaded in 0.233988
records red in 0.67486

$ gcc -O2 -o mapc map.c
$ ./mapc
records loaded in 0.177162
records red in 0.183764

$ head dict
a
A
AA
AAA
Aachen
aardvark
Aaren
Aarhus
Aarika
Aaron

$ cat dict | wc
62074 62074 547509

$ uname -a
Linux compaq 2.6.18-6-amd64 #1 SMP Tue Aug 19 04:30:56 UTC 2008 x86_64 GNU/Linux

$cat map.cpp  


#include <map>
#include <string>
#include <iostream>
#include <fstream>
#include <sys/time.h>

using namespace std;

int main() {
    map<string, string> hash; 
    string line;
    struct timeval start, middle, end;
    time_t s,ms;

    gettimeofday(&start, NULL);

    ifstream in("dict"); 
    while (in >> line) {
        hash[line]=line;
    }
    gettimeofday(&middle, NULL);
    s = middle.tv_sec - start.tv_sec;
    ms = middle.tv_usec - start.tv_usec;
    if(ms<0){ s--; ms+=1000000; }
    
    cout << "records loaded in " << s << "." << ms << "\n";

    for(int i=0; i<10000; i++) {
        hash["abracadabra"];
	hash["vasya"];
	hash["test"];
    }
    gettimeofday(&end, NULL);
    s = end.tv_sec - middle.tv_sec;
    ms = end.tv_usec - middle.tv_usec;
    if(ms<0){ s--; ms+=1000000; }
    
    cout << "records red in " << s << "." << ms << "\n";

    return 0;
}


$cat map.c

#include <string.h>
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <sys/time.h>

#define MAX_MAPPING_FILE_LINE 1024
#define MAX_KEY 1024

#define OUT_OF_MEM() { \
			fprintf(stderr, "Out of memory: %s:%d", __FILE__, __LINE__); \
			exit(-1); \
		}


typedef struct hash_bucket{
    char* key;
    void* data;
    struct hash_bucket* next;
} hash_bucket;

struct hash{
    long size;
    long n;
    long used;
    hash_bucket** table;
    char name[256];
};


static unsigned long calc_hash(char* str){

    unsigned long hash = 0;
    int c;

    while ((c = *str++))
        hash = c + (hash << 6) + (hash << 16) - hash;

    return hash;
}


struct hash* hash_create(long size){

	struct hash* table;
    hash_bucket** bucket;
    long i;

    table = (struct hash*)malloc(sizeof(struct hash));
    if(table == NULL){
    	OUT_OF_MEM();
    }

    if (size <= 0){
        free(table);
        return NULL;
    }

    table->size = size;
    table->table = (hash_bucket**)malloc(sizeof(hash_bucket*) * size);
    if(table->table == NULL)
    	OUT_OF_MEM();

    bucket = table->table;

    if (bucket == NULL) {
        free(table);
        return NULL;
    }

    for (i = 0; i < size; i++)
        bucket[i] = NULL;

    table->n = 0;
    table->used = 0;

    return table;
}


void* hash_put(struct hash* table, char* key, void* data){

    unsigned long val = calc_hash(key) % table->size;
    hash_bucket* bucket;

    if ((table->table)[val] == NULL) {
        bucket = (hash_bucket*)malloc(sizeof(hash_bucket));
        if (bucket == NULL)
        	OUT_OF_MEM();

        bucket->key = strdup(key);
        bucket->next = NULL;
        bucket->data = data;

        (table->table)[val] = bucket;
        table->n++;
        table->used++;

        return bucket->data;
    }

    for (bucket = (table->table)[val]; bucket != NULL; bucket = bucket->next)
        if (strcmp(key, bucket->key) == 0) {
            void* old_data = bucket->data;

            bucket->data = data;

            return old_data;
        }

    bucket = (hash_bucket*) malloc(sizeof(hash_bucket));
    if (bucket == NULL)
    	OUT_OF_MEM();

    bucket->key = strdup(key);
    bucket->data = data;
    bucket->next = (table->table)[val];

    (table->table)[val] = bucket;
    table->n++;

    return data;
}

void* hash_get(struct hash* table, char* key){

    unsigned long val = calc_hash(key) % table->size;
    hash_bucket* bucket;

	/* printf("getting attempt %s<\n", key); */

    if ((table->table)[val] == NULL)
        return NULL;

    for (bucket = (table->table)[val]; bucket != NULL; bucket = bucket->next)
        if (strcmp(key, bucket->key) == 0)
            return bucket->data;

    return NULL;
}


static int read_mapping(const char *filepath, struct hash *map){

	FILE *fp = NULL;
	char line[MAX_MAPPING_FILE_LINE];
	char *delim;
	char *valStart;
	long lineNum = 0;
	long len;
	char key_buffer[MAX_KEY + 1];
	char val_buffer[MAX_MAPPING_FILE_LINE + 1];


	if ((fp = fopen(filepath, "r")) == NULL){
		printf("Cannot read %s\n", filepath);
		return -1;
	}

	errno=0;

	while(fgets(line, sizeof(line), fp) != NULL){

		if(errno){
			fprintf(stderr, "Can't read from file %s: %s\n", filepath, strerror(errno));
			exit(-1);
		}

		if(line[sizeof(line)-1] == '\n')
			line[sizeof(line)-1] = '\0';

		len = strlen(line);

		if(len == 0){
			fprintf(stdout, "WARN: line %ld is empty -- ignored" ,lineNum);
			continue;
		}

		strcpy(key_buffer, line);
   	   	strcpy(val_buffer, line);

     		hash_put(map, strdup(key_buffer), strdup(val_buffer));

	}/* fgets */

	errno=0;

	fclose(fp);

	if(errno)
		fprintf(stderr, "Can't close file descriptor %s: %s\n", filepath, strerror(errno));

	return 0;
}

int main(void){
	struct hash* map;
    struct timeval start;
    struct timeval middle;
    struct timeval end;
    time_t s,ms;
	int i;

    	gettimeofday(&start, NULL);

	map=hash_create(100000);
	read_mapping("./dict", map);

    	gettimeofday(&middle, NULL);
    	s = middle.tv_sec - start.tv_sec;
    	ms = middle.tv_usec - start.tv_usec;
    	if(ms<0){ s--; ms+=1000000; }

	printf("records loaded in %d.%d\n", s, ms);

   	for(i=0; i<10000; i++) {
        	hash_get(map, "abracadabra");
		hash_get(map, "vasya");
		hash_get(map, "test");
    	}

	gettimeofday(&end, NULL);
	s = end.tv_sec - start.tv_sec;
    	ms = end.tv_usec - start.tv_usec;
    	if(ms<0){ s--; ms+=1000000; }
	printf("records red in %d.%d\n", s, ms);

	return 0;
}
Berkley algorithm used in custom hash (the most suitable for human language hashing): I've got 0 collisions populating English words (dict) using it and 70 colisions - with Bernstein algorithm, so I'm using the Berkley one)


 static unsigned long calc_hash(char* str){
     unsigned long hash = 0;
     int c;
     while ((c = *str  ))
         hash = c   (hash << 6)   (hash << 16) - hash;
     return hash;
 }

$ javac Map.java

$ java Map
records loaded in 1829ms
records red in 86ms

import java.io.*;
import java.util.*;

public class Map{
	public static void main(String[] args){
	   try{
		long t0 = System.currentTimeMillis();
		HashMap<String, String> hash = new HashMap<String, String>();
		BufferedReader is = new BufferedReader(new FileReader("./dict"));

		String line = is.readLine();
		while(line!=null){
			hash.put(line, line);
			line = is.readLine();
		} 		
		System.out.println("records loaded in " + (System.currentTimeMillis()-t0) + "ms");
		t0 = System.currentTimeMillis();
 		for(int i=0; i<10000; i++) {
        		hash.get("abracadabra");
        		hash.get("vasya");
        		hash.get("test");
    		}
		System.out.println("records red in " + (System.currentTimeMillis()-t0) + "ms");
	   }
           catch(Exception e){
		e.printStackTrace();
	   }
	} 
}

$ g++ -O2 -o hashmap hashmap.cpp
$ ./hashmap
records loaded in 0.521390
records red in 0.160082

$cat hashmap.cpp

#include <map>
#include <string>
#include <iostream>
#include <fstream>
#include <ext/hash_map>
#include <sys/time.h>

//std::string does not have declared the following because
//hash_map is considered to be ext, so declare it here:
namespace __gnu_cxx
{
        template<> struct hash< std::string >
        {
                size_t operator()( const std::string& x ) const
                {
                        return hash< const char* >()( x.c_str() );
                }
        };
}

using namespace std;
using namespace __gnu_cxx;

int main() {
    hash_map<string, string> hash; 
    string line;
    struct timeval start, middle, end;
    time_t s,ms;

    gettimeofday(&start, NULL);

    ifstream in("dict"); 
    while (in >> line) {
        hash[line]=line;
    }
    gettimeofday(&middle, NULL);
    s = middle.tv_sec - start.tv_sec;
    ms = middle.tv_usec - start.tv_usec;
    if(ms<0){ s--; ms+=1000000; }
    
    cout << "records loaded in " << s << "." << ms << "\n";

    for(int i=0; i<10000; i++) {
        hash["abracadabra"];
	hash["vasya"];
	hash["test"];
    }
    gettimeofday(&end, NULL);
    s = end.tv_sec - middle.tv_sec;
    ms = end.tv_usec - middle.tv_usec;
    if(ms<0){ s--; ms+=1000000; }
    
    cout << "records red in " << s << "." << ms << "\n";

    return 0;
}


So, what surprised me is that Hash Map suggested by C Plus Plus gurus as the real hash (vs red-black tree algorithm in the default Map) is even slower.
And both are more than 3 times slower than C implementation (statistics of multiple runs is now shown and times are shown only for demonstration - you can try by yourself on your machine).
Also notice that C implementation is brute-force not optimized one and it is possible to write custom allocater rather than doing malloc every time.



Correlation between periods of world financial crises and significant shifts in computing.
[info]siberean

Or a chance for the next technological move towards more cost-efficient computing, the elimination of closed-loops of vendor-locked inefficient schemes, especially in ‘Big’ IT.


As it is well known – big IT clients (such as banking, government, utility companies, insurance companies, big retailers) are very hard to shift and reform. The reasons behind this inertia include: 1) there are many regulations and rules, so it is hard to push forth  innovations, as opposed to smaller IT organizations; 2) those are ‘Gig’ clients and they frequently employ 3rd-parties for their IT needs (consultants, subcontractor firms, IT 'solution providers'), frequently not having industry-leading experts in-house; 3) the existence of inherited infrastructures closely bound with 'solution providers', means that the companies the ‘Gigs’ are dependent on, are not constantly bombarded by  big competition (due to already-existing implementations and infrastructure). It is thereby likely that the next project will be implemented using existing infrastructure technologies in an iterative manner, rather than adopting something from the outside. This is because, at first sight, the cost of addition (of another feature or a new project) seems to be low (since infrastructure is already in existence), and similarly, the risk seems to be low – at least from the managerial perspective ('works – do not touch!').

 

But there are catches: the overall license costs are astronomic; they are periodic, permanent, and on the increase. The more projects bound with the technology, the harder it will be to cure/reform the system in the future (more projects will be necessary to promote change). It is a unidirectional move toward an infinitely growing budget, where dependency on the vendor (vendor-lock) increases as more and more projects are added. Another big issue is the fact that some 'Gigs' include not-for-profit organizations or the government. So, there is a clash between the interests of for-profit vendors interested only in the growing presence of permanent cash-flow from big organizations like governments, and the interests of governments to re-allocate limited resources to other urgent needs such as health-care or education (discussion of this is beyond the scope of this article).

 

 

However, there is another, more significant closed loop (for closed loops – read about the recursive functions theory or see Hofstadter’s book*): the people’s skills loop, which is much harder to break. Nobody who has been recently considered a highly-paid professional will want to start as a university graduate, putting passed  courses diplomas and certifications aside to start learning something completely different (in which universities are frequently giving more deep education) and to collect experience in those new areas, from scratch.

It is much easier (as it seems to many) to find another place where their old skills are still required, or more likely, to change absolutely nothing but rather push the manager to pay for another license term.

 

So the loop consists of the following:

1) The industry requires some skills to support the infrastructure that is already in existence (because there are always old projects on the basis of old technologies, somebody needs to maintain them);

2) Education centers (it is a free market!) provide the courses which are acquired by “Gigs”, where only 'Gigs' can be the main customers of such courses (universities are cheaper!). Then sub-contractors and consulting firms looking into the demands acquire professionals with the same skills;

3) Professionals push the solutions that they know, that they were taught and have had experience with (not those which are the most cost-efficient and/or which they do not know);

4) Managers (of ‘Gigs’ IT departments) look into _current_ industrial trends (who defines the trends?) and courses-on-offer (particularly, those they are most familiar with, having completed the same certifications). They then send personnel into the same fields (that again, they, the managers, are most familiar with).

5) IT personnel, having completed their certifications at the courses, implement the same closed solutions, thereby continuing the closed loop.   

 

It is a primary feature of closed loops that they cannot be resolved by themselves (Gödel theorem) and hence a solution can come only from the outside.

 

But there comes a crisis (outside forces) which cures the system!

 

Let me explain:

 

In this article I'd like to highlight the interesting correlation between crises and significant innovations in big static IT departments: periods of crisis and innovation alternate.

By crises I encapsulate both recent decades’ oil and commodity price drops and financial crises. It will also be discussed how the IT industry was able to break vendor-locks in the past and find more (cost) efficient schemes of computing, i.e. to cure itself. Solutions come from outside of those closed-loops, and the ill industry (bound with some old vendor-lock solution) accepts those innovations, thereby reforming itself.

 

We will not touch on home computing (Amiga, Atari, Mac, PC and mobiles burst of the recent years) – since in this article the focus is on the ‘Gig’ IT industry. Although workstation costs, especially licensing, may be very significant for big organizations, the assumption is that workstation or client machines are a part of the whole infrastructure and are usually closely bound with the technologies used on the central (server) machines. We are more interested in server machines, OLTP machines, central batch processing units, database servers –  machines serving big numbers of clients (clients may be both external, such as public Internet users, or internal, such as Intranet workstations users). Those central machines are usually the most expensive ones (mainframes or server farms, clusters, multi-way servers, big file and database storages) due to expensive top-hardware as well as high license costs (usually per-processor).

 

Although commercial computers appeared in the 50s – let’s begin in 1975 since before that there was a pure mainframe era, where one vendor dominated the market.

 

Let word financial crises (drops of oil prices) be marked by pipes on the year lineage below. As can be seen, periods between both crises and significant shifts toward less expensive technological innovations, span an average of 7 years (2000 is an exception: y2k bug fix is there which is a very special case – for the industry).

 

 

--75----------80----------85----------90----------95----------2000----------2005---------2010--

    |                       |                                |                       |                 |                       |

 

The single-digit numbers in the next chart represent eras or computing epochs (will be explained later), where the edges between epochs (pipes) can be characterized by significant shifts in technology/thinking towards more competition. Notice again that in this article we are interested only in ‘Big’ computing and ‘Big’ computers while by the shifts we mean innovations, ‘revolutions’ in the minds which brought some kind of decentralization in each case, introducing a less costly architectural solution for the problem.

It is interesting how those ‘significant shifts’ occur exactly in the middle of crises regardless of the cause: whether a cheaper solution is a consequence of the corresponding crises because of industrial demand (demand in a cheaper system), or vice-versa: each crisis simply releases more talented professional resources into the world, and those talents (the number of which is always limited) have more time for free creativity: to make a better system. And then the system (‘Gigs’: banks, government IT etc) just inevitably accepts those ‘solutions’, born outside as the more efficient.

Notice that a totally new system/solution cannot be created 'inside the system', it must be 'above' the system, coming from the outside (which has a deep mathematical background, beyond the scope of this paper).   

 

--75----------80----------85----------90----------95----------2000----------2005---------2010--

1|                2                  |               3                 |        4        |         5          |          6          ?

                      

           

1)      Mainframe era
        *    all systems are proprietary, expensive, centralized

*    one vendor in the market

*    only the wealthiest organizations can afford such systems because of the cost: millions of dollars

2)      Minicomputers era (3rd generation computers like DEC PDP-11/50):

*    first TSS (time-sharing systems) appear

*    no payment for central computer time

This is a paradigm shift from #1, and there is no longer a dependency on one vendor: it is a move to decentralization.

3)      Era of commercial Unixes:

*    much more 'standard' and simple system than #2

This is a revolutionary shift because compared with the old custom systems, where no standards existed, - more vendors can implement well known, and which is more significant,– simpler OS design standard.

*    there is competition between vendors (HP Unix, IBM AIX, Sun Solaris), and moving to less expenses boxes

*    standardization of DBMS: SQL standards. Oracle, supporting the standard begins dominating (as opposed to hierarchical or other databases)

*    networks boost encourages more decentralization. Databases can now be installed on different Unix systems and competition is rampant.

4)      Windows (personal home computer) arrives into enterprise as an even cheaper solution than #3 (less than tens of thousands of dollars per server rather than hundreds of thousands of dollars per Unix servers):

*    PC hardware is seen as ‘almost free’ hardware. Although there are catches (which are ignored as solvable by themselves within a few years): in particular, robustness and scalability. The main point: the software in this period is still far from being free.

*    But the programming becomes cheaper and cheaper (C++, VB, Delphi, java, html/JavaScript) rather than C, assembly in era 3. More programmers can now do it

*    More recent invention of the Web (actually it was earlier, in 91, in 3, but only now it becomes practically useful due to faster cheap networks), – everything is moved into the web: both thick and thin clients (the latest are reincarnations of ‘dumb’ terminals from 1, 2, but now – running on $1k commodity hardware). Windows dominates the client and also NT comes into the server.

*    Cheaper MS SQL server competes with Oracle.

 

5)      Java becomes the main application logic language, pushing out C++ and Y2K contributes to java dominance as the language of business logic (Cobol substitute in many projects).

*     Appearance of the first good free environments (for java for example).

*    Open-source arrival into enterprise.

*    Google proves that even clusters hardware may be commodity cheap hardware running free software.

6)      After Dot-Com crash.

*    Linux becomes scalable enough for the enterprise (after the appearance of the 2.6 Linux kernel) and starts to substitute commercial Unixes inherited from 3. Now the whole  OS on the server can be free.

*    Commodity processors (Amd64) becomes 64-bit. Now free 64-bit OS can be installed on very cheap 64-bit processors. For example - clusters (and most of the Top500 computers) are built exclusively using commodity processors (such as AMD64).

*    Free databases become scalable for enterprise (MySql, PosgreSql)

*    Open-Source arrival to the enterprise

*    Web2 (O’Reilly terminology): building of rich applications on the web. Competition in browsers area, standardization of JavaScript, DOM and CSS which avoids necessity in non-web rich clients

*    Outsourcing of the software development and hosting (although this is not always appropriate)

 

So, this epoch can be mainly characterized as the arrival of free high-quality and scalability for enterprise software (most of epoch 5 software was non-free). This process is still continuing and more and more big companies and governments are moving towards open-source software running on free OS, development under free IDEs, the use of free frameworks and complete free sites (example Drupal and hundreds of PHP, java frameworks).

 

 

Now, in 2008, we have reached the next crisis. How we can make IT more cost-efficient now? What remains expensive in IT infrastructures today? Where do vendor-locks continue to exist?

 

 

RDBMS used in big installations are still mostly vendor-locked and non-free. Free databases (such as MySql and PostgreSql) are successfully used by small companies for very large installations, scaling well for a very large number of clients, but still rarely used in government environments (although more and more installations have already proven their efficiency). As it is seen by many people – there will be a big shift in this area (I put a question mark in the chart – estimating the point when the period will end). I believe (sure, if the progress will go on and no reason to think otherwise) that probably all the databases will already be shifted to free ones by 2012 and most DBAs will be familiar with them - in the same way that all epoch-3 (thick client) programmers became familiar with the web. So, licensing costs on the server-side is still a big issue for big organizations.
 

Application servers and development environments are frequently non-free (due to the existence of EJB architectures) but this is quickly changing, moving to the lighter ORM solutions (such as hibernate) or direct JDBC database connections (interfacing through SQL in a PHP manner) where there are no over-engineered additional client-server layers with all the added and unnecessary caching/scalability complexities. 

The simplification of server infrastructures (removal of complex custom extra tier and scaling by means of database clustering and fail-over) and the making of the Open Source database as the standard can be compared with the arrival of Unixes during the Minicomputers 2 era (as the new simpler standard).

What else is still vendor-locked? The client (mostly XP), inherited from era 4 (license costs multiplied by number of machines), while Ubuntu or Debian (or any other) Linux distribution with OpenOffice and thousands of free tools can cover 90% of user tasks on the client desktop. Sometimes even a light-client remote office will work (similar to what Google is trying to promote), but either as a client through the old familiar X or a web-client to a Google-docs-like but – installed on the Intranet server – for familiar secure document editing. Old robust NFS, maybe together with SMB, all familiar Unix tools, but now – free.

Making big infrastructures vendor-lock free, regarding the software – is the only way – to break another closed loop and to make IT departments eat less money from organizations during hard times.

And we have shown that the whole computing history is the permanent move to less and less vendor-locked systems to cheaper and lighter (simpler) solutions. There is no reason for this process to stop. Crises are just catalysts of this process and bring more clever and simpler solutions from outside of the systems. Even such restricted-to-change organizations as government IT's have to accept those changes for their own survival and benefit. The above-mentioned possible moves (databases, desktop client) toward free solutions are only guesses, but analogies with past era movements make me think that those assumptions are right - also, this is the only right way to go (the obvious, progressive way): to the less vendor-locked world.

 

Nov 23, 2008

Toronto

 

 

*) Douglas R. HofstadterGodel, Escher, Bach: An Eternal Golden Braid”, 1979, 1999.

 

a1db4c3cab5da30e5cd0e18b6abd6c2b

Scalable and easy maintainable archiving system for lifetime archives from old hardware
[info]siberean

 



Old machines, even such as Pentium ММХ, Pentium 2 etc (not speaking about first Pentiums 4, long time ago been moved by some folks to the range of outdated) can work even today - for some very real tasks.
Unfortunately, I can't employ my first 8088-2, 386DX-66, 486-100 because they were lost during movings (and also during the studentship - you offen sell your old PC - to get money for the next, more powerful machine ;).
Also no sense to search for the old 1.44M diskettes with Slackware2, kernel 1.0 (how it would be required with alternative OSes) because even modern Linux distributions will perfectly work on such hardware. There are specilized light distributions (with light X window managers), but even full-featured distributions will work (although on Pentium 1-2  class machines - without Gnome/KDE, even without X. And actually X is not necessary for the tasks described below).
For example, on the screenshot - Gentoo 1.4 (2004) is shown, on another machine - Gentoo 2006 or close, and Debian - on a more powerful machine, from which the screenshot was made, and where X Windows (Gnome) is installed.

How such machines can be used efficiently?
Everybody might think about use as a router, firewall (with custom-made firewall rules for  additional filtering and logging, in cases where small consumer appliances such as DLink, Linksys etc are not giving the desired flexibility/functionality), as a shared file-server, DNS server etc.
However in this article I'll describe the archiving solution which I use many years, where more than one of such machines are working in a concert as almost zero-cost backup servers.

Another thing worth to notice here is that despite it's age - such old hardware only seems to be non-robust. It depends. For example, similar to the shown on the screenshot Pentium ММХ had been working another 5 years (after it's desktop life) as the firewall and DNS server, and even Apache was there with few modules, including jserv (with only 64M RAM on that machine), been 24/7 on. And is still alive (3 years as a desktop, 5 years as a server +).
And my Pentium3-450 is already powerful enough - to watch video, and I still use it sometimes - for Internet browsing, or even work. At the same time - today's hardware components may die after just a few years of work. So, I'm not agree with those who will say that maintaining of such machines is permanent headache - just because they worked out their resource/life. Even if we'll take the most fragile part - moving disk: I had 200G disk died while 8.5G disk is still working as a boot disk on one of old machines. And you should agree that it is good - when the hardware is working untill the very end of it's resource lifetime and sometimes (I would say - very common) such lifetime is amazingly long.

So, back to the main purpose of the article - about usage as archive servers.
Archive-servers is a special case where you need redundency of the data and even more - you need distributed redundency (so simple NAS or even RAID will not be sufficient and will not solve the main task - to be sure that a local fire or a theft of the computer will not destroy all your home archives, photos etc, the resources which cannot be reproduced or downloaded from the Internet again). At the same time such archives mirrors must be always updatable because we change ourselves and grow every day: we re-evaluate our values, reorganize archives, adding to it, make new updated copies (example - adding smaller versions of photos, reorganizing) etc. In extreme cases - we want to rename the root directories (update the ontology hierachy ;) in the main, master file-server. In everyday cases - adding of newly arrived emails, articles, copied-pasted chunks of information, updating resumes and documents, adding of other information pieces. Those changes should be incrementally propagated to the mirrors (secondary archives/backups), not taking a lot of time from our valuable lifetime. Actually we do not want to convert ourselves into sysadmins and to spend a lot of time on system administration of the backup systems during evening hours and weekends ;) so we need an automatic process.
Sometimes I also make a whole system backups (of kids' or wife's computer or a boot disk of the archive computer or another linux machine), so when it is necessary or when an old 8-10G hard-disk on an archive server will show the first signs of aging - I''ll be able to quickly replace the disk and copy a compatible system from the corresponding tarball, modifying only a few configuration files rather than wasting time on re-installation.

About terminology here: main archive server is what all workstations see (through samba or NFS) and where we make daily or weekly backups to. If any document or photo is necessary and it is already moved from the camera - the archived server is booted. After a significant portion of the work is done - the archive server is booted, almost every day (or at least weekly) - for a few minutes at least, and it does not work permanently. 
So, archive computer is the one which must be visible from all computers (workstations) on the network and is a level 1 backup or master backup, where all archive modifications are made. It is beneficial to run both NFS and SMB on it - to make it visible to all workstations including Windows.

Once in a week (or frequently - depending on how much information you do not want to loose - if a disk on the archive server will fail) - changes on the archive server is been synchronized (pushed) to the secondary archive server. That one is an incremental update and is fast, so such task does not take a lot of time. rsync is the perfect utility for one-way synchronization and actually we do not need more as will be shown shortly.
One of things one should be careful with - is rsync's "--delete" option, which is necessary - when you want your main working archive copy (master) to be propagated to the slave mirror. If a source disk will suddently crash or root catalog is removed accidently - you do not want to propogate such changes to the secondary archives). I have "--delete" always commented out, automatically propogating only additions. And rarely (or when I want to delete something expilitly) - I invoke rsync with "--delete" - watching - what is exactly been deleted. So, do not use "--delete" in automatic scripts and use it manually, under supervision.

One would use 3rd level of archive (3rd copy), propagating changes on secondary (or on first) archive server - to another copy, say, yearly. And to put that yearly archives snapshot - into a safe place, into a remote location (in case of fire, disasters etc), into a bank safe etc. This will guarantee that you will have a copy of your generation archives (or soon - more than one generation) somewhere else, not on one localtion.
Hopefully, it is clearly seen than RAID solutions are not solving the problem, they protect against local disk failure, but not - the data redundency. And looking from another side - if we already have distributed data redundency - why we need expensive RAID solutions at all? :)

Now - a bit about what and how we copy to the master archive. It depends on the preferences and significance of the daily/weekly data, in which way it comes daily (version control working copy, files, mails, data chunks, only significant emails been copied into text files or large hierarchies).
For me, for example, it is very convenient not to think each time - where I need to save (because this data will accessed rarely), but - to do it fast enough, so I'm saving files in the original form, AS IS, using directory as a unit of saving - when number of files is bigger tham\n one. If it is a chunk of data (copied/pasted piece of code, for example) - it is saved in a file.
The name of the directory in the first case and of the file in the second - is not very significant because you will have a huge number of files in the archive anyway and the name will not help much - you need an index or a database over your archives anyway. And the proper maintanance of indexes is a big topic by itself (I'm maintaining index in multiple ways including text index and scripting) and is out of scope of this article.
So, directory names and filenames may be just today's date (in a way 2008-11-21 or similar) or some another way. Some folks even use MD5 or another hash in the same way as some version controls and content management systems are doing - as additional data integrity check but I do not see a big benefit comparing with overcompication of the whole process (at the end - we use the same filesystem at the backend and filenames is not an issue - they are original filenames and storing the original filenames directly is a big benefit by itself, while control sums can be used independantly, not hiding the original filenames).
Another think worth to notice in this paragraph is that the hard-disk space is cheap today and by any means - it is not worth to spend your valuable lifetime minutes on archives maintaining
and the most frequent operation (daily copying) must be as faster (and as simpler) as possible: just copying (one command), AS IS, instantaneously.
The second most frequent operation - is weekly backup of the archive server and it also should be automated: switch on 2 machines, run script (or run script automatically on boot), wait for a few minutes (so, the backup should be incremental) and shut down the machines. If delete changes should be pushed to the secondary archives - nothing to do - you should supervise the process (as mentioned before) and spend those minutes watching - whether all disks are OK and the deleted staff is the correct one. Yearly process is not very different from the weekly one (in the process) with the exception that the amount of information is bigger, there were likely big reorganizations of the data been performed and so - the synchronization time to wait will be just bigger. 

As I've found, old Pentium - class machines are perfect as 'slave' machines for such tasks. Why 'slaves'? After hierarchy became big - I noticed that 64M RAM became too small for rsync (after the hierarchies on the disks became big) and since rsync is doing comparison of the trees in-memory- I started to use a 'master' machine - for invoking scripts and to mount other (Pentium1-class) machines through NFS. It seems - one could theoretically have unlimited number of such machines and a simple shell-script update - is the only modification - to scale up to unlimited volumes of data. So, the shown Debian (more powerful machine with X) is the 'master' machine and abovementioned archive servers (1st, 2nd, 3rd levels) are 'slaves' - in my terminilogy - like in computing clusters. Another similarity with clusters is that master gives the tasks to slaves communicating through NFS, distributing the tasks and that the number of slaves is not limited (so as the volume of the archives storage).
Another benefit of the 'master' is that it is easy to maintain scripts in one place (and sure - do not forget to backup the 'master' system into the archive too.

Why NFS is used? I tested samba and smbfs at some point and found that NFS was better utilizing the network bandwith than smb, so the more efficient, faster protocol was chosen for the backup cycle - to wait less minites weekly (sure, no sense to run the process through ssh on such processors as Pentium 1).
The only expense for the archiving solution (we are not counting disks which are necessary anyway and whcih were purchased in different years - only when required by growing data) - was IDE controllers bought on ebay for less than $30. I compiled kernel with support of 3 main controller types, I use once and reuse those kernels/systems. At some point I've got a few problems with Sil controllers while never - with Adaptec, so I'm preferring the latest one, or the  Promise ones, but this a personal preference.

I thought that using today's 1.5T disks, one could place even 6T of storage per garbage-class (for most of alternative system users) Pentium1 machine and the only difference (with the modern machines) you will notice - is the speed of copying from machine to machine, which  will not be actually magnitude of order or even multi-folded (if updates are incremental) - comparing with CPU power differences. This probably depends on the data and on multiple factors. I do not need such enterprise storages, so I use old disks with different volumes - to utilize the old disks - untill their own natual deaths. The policy is the following.
Disks are named as 1,2,3... and if a new machine is added - the count is continues. It is been inherited from the initial partitioning of the archives into physical disks (back in middle 90s). Then - when after years a disk is naturally dies or no sense to maintain, say, 100M disk, - those numbers are still preserved and moved to the new disks now as a directory (all archive disks are having one physical partition, except the ones used also for booting). If a bigger disk is coming on place of multiple smaller ones - it gets corresponding directories: 1,2,3... - depending on how many old disks are copied to it etc.
Now let's imagine that one disk dies on the main (source) archive server (on the file-server which you use for weekly backup and which you use for reorganizing of the data). You take the disk from the second mirror and put it into the master. Then - you buy a bigger disk (because you probably want to buy the one with the cheapest cost per Gigabyte for today), and depending on it's volume - create directories - for mirroring other (smaller) disks on the master. You 'rsync' from the 'master'  now and since the disk is bigger - you can free some other small disks from the slave (for example, recently I freed 200G disk which is waiting - when any of the next smallest ones 10-year old 8G or 10G disk will die). Etc. Having such substitution policy - all disks are working the whole lifetime while the copying (of both the data and the system - when the boot disk dies) - does not take a lot of time, so maintanence is minimal possible. And no data is lost. 'rsync' is doing the main job (tar - when a boot disk has to be recovered).

The above-mentioned archiving process allows to utilize the old hardware completely (which seems to be the most cost-efficient way), without any additional spending (including time). 

a1db4c3cab5da30e5cd0e18b6abd6c2b

Wildlife in the middle of the day in Toronto
[info]siberean


and what left after








</p>

Link to the gallery: pics.livejournal.com/siberean/gallery/0000a9yy
 




Opossum in Toronto in the middle of the day
[info]siberean








Home