Решение задач на условные вероятности для школьников
[info]siberean
...или как школьник может правильно оценивать вероятности, имея дополнительную информацию, но - без применения формулы Байеса.

Теорема Байеса в канонической форме:
    P(A|B) = P(B|A)*P(A)/P(B)
или
    P(A|B) = P(B|A)*P(A)/(P(B|A)*P(A)+P(B|~A)*P(~A))
хорошо запоминается, но не всегда правильно интерпретируется - в применении к конкретным задачам.
Я, например, помню, как будучи студентом, испытывал затруднение с идентификацией всех prior, posterior, в зависимости от по-разному сформулированных задач и особенно со сложными условными вероятностями. Таблицы истинности помогают, но особенно мне помогли диаграммы Вейна.

Здесь же я хочу показать использование древесной диаграммы - как она может помочь в решинии подобных задач.

Разберём следующую классическую задачу (с близкими к реальным числам):

Известно, что 1% населения заболевает раком. Тест детектирует заболевание в 80% случаев. В 9.6% случаев заболевание детектируется неправильно, т.е. у тех, у кого его реально нет (false positive). Определить - какова вероятность реального заболевания - если у конкретного человека детектирован рак.

С подобными задачами приходится сталкиваться на каждом шагу: от физического эксперимента до ситуаций в быту, в тех случаях - когда существует погрешность прибора, т.е. повсюду.

Можно заполнить таблицы истинности, нарисовать диаграммы Вейна, подставить числа напрямую в формулу Байеса. Но, как сказано выше - прежде всего числа нужно правильно интерпретировать.


Я предлагаю следующий формализм для решения задачи. Рассуждаем так.

1) Что произошло вначале, перед другими событиями задачи, перед экспериментом? Был известен факт: 1% населения подвержено заболеванию. Это существовало до теста и получено независимым (от теста) путём (наша аксиома).

Нарисуем ветвение (ясно, что если 1% "ушёл" по одной ветви, то его "дополнение" до единичной вероятности, 99%, - окажется на второй):

        |----1%----> TRUE
--------|
        |----99%---> FALSE

Кстати, в диаграмме Вейна это будет маленький кружок (1% площади 100%-ного мира) в пространстве мировых событий.
Словами TRUE, FALSE мы обозначили включение в категорию больных, а также нахождение вне этой категории, соответственно.

2) Что произошло далее? Был произведён эксперимент. И мы знаем два числа, характеризующих точность. Чуть позже мы идентифицируем - что значит каждое из чисел. Пока сильнее "разветвим" дерево (как в программировании), чтобы получить все комбинации: для TRUE мы имеем 2 состояния прибора: (positive, negative), и для FALSE - то же самое. Ведь тестируются все - как больные так и здоровые, и назначение прибора узнать - кто реально больной а кто - нет. Чисто формально:

                                 |-------->positive
        |----1%----> TRUE|
        |                        |-------->negative
        |
--------|
        |
        |                           |------->positive
        |----99%---> FALSE|
                                    |------->negative

3) Теперь вернёмся к условиям задачи и паре чисел, которые мы не использовали.
Фраза "тест детектирует заболевание в 80% случаев" означает "positive" под TRUE (тест позитивный при наличии заболевания), так что можем вписать 80% в самую верхнюю ветку.
Фраза "В 9.6% случаев заболевание детектируется неправильно" означает что болезни нет (нижняя половина дерева) но тест "positive".

Опять чисто формально заполнили дерево числами (все числа "разбросаны", ничего не осталось, хороший знак, значит можно переходить к интерпретации). Кстати, мы могли бы записать дополнения (до сотен процентов): 20% и 90.4% в незаполненных ветвях, но зачем? Они нам могут и не понадобится (если надо - всегда допишем!).

                                 |----80%---->positive
        |----1%----> TRUE|
        |                        |-------->negative
        |
--------|
        |
        |                           |----9.6%--->positive
        |----99%---> FALSE|
                                    |------->negative

Ветви можно обозначать полными путями:
/TRUE/positive
/TRUE/negative
/FALSE/positive
/FALSE/negative


Или использовав "белый" разделитель и убрав "капитализацию" (как принято в человеческих языках, или как используют те же физики) - получим:
true positive, true negative, false positive, false negative.
Как легко запомнить терминологию физиков-экспериментаторов! (достаточно просто "пройтись" по путям дерева.

(кстати, на диаграмме Вейна выше: true positive - красный, true negative - синий, false positive - зелёный, false negative - жёлтый, то-есть зелёный "заметает" или пересекает 80% малого кружочка, но имеет однако большую часть вне его, целых 9.6% всего "мира")

4) Так что нам надо было узнать в задаче? (все числа мы уже разместили).
"Определить - какова вероятность реального заболевания - если у конкретного человека детектирован рак".
Иными словами, зная, что детектирован рак (заметим, что это "positive" в наших обозначенияхи встречается это событие на 2х верхних подветвях, т.е. другие ветви можно "отмести", ведь мы уже знаем точно - что тест позитивный!) - надо определить вероятность реального заболевания.
А реальным событием является самая верхняя подветвь.
Итак, надо найти:
<Вероятность реального заболевания|positive> = <TRUE positive>/<ALL positive>
где <ALL positive> - это все "positive":
<ALL positive> = <TRUE positive>+<FALSE positive>

Чтобы найти вероятности на ветвях - надо перемножить вероятности на всём их пути, т.е.:

<TRUE positive> = 1% * 80% = 0.01*0.8 = 0.008
<FALSE positive> = 99% * 9.6% = 0.99* 0.096 = 0.09504

Окончательно
(лишь сейчас введя обозначение условной вероятности в левой части, заметим как мы лихо обошлись без них!):

<Вероятность реального заболевания|positive> = 0.008/(0.008 + 0.09504) = 0.0776397516 = ~0.078 = ~7.8%

Т.е. получили результат - что только с вероятностью 7.8% человек реально болен, при условии что тест дал положительный результат.
Результат такой же - как и в случае применения формулы Байеса, но как мне кажется - с меньшей вероятностью ошибиться.


Помещу ещё раз все вычисления на одной картинке: это гораздо короче и быстрее чем в традиционном курсе "Теор.Вероятности и Мат.Статистики"!
(кликните на картинку для увеличения и сорри за "кривизну" рисунка, извините - я не художник)


  • Add to Memories

Compact text format for genealogies storing and exchange
[info]siberean
The following new proposed format allows sending genealogy trees to relatives, potential relatives, researches or just to store and organize genealogical information in a compact form (for notes).
This format is for the ascending tree only and shows all direct ancestors of the root person (it is not supposed to work for more general graphs for descending genealogies). Each person has it's own unique binary tree.

If you have a GEDCOM file - you can get your own atree using "aisgedcom" program which is available here. This is the simplest way to get your own atree (and most of genealogists usually have their genealogy in GEDCOM format, been exported from some program (notice that it should be plain text GEDCOM, not XML format, version 5.5 or 5.5.1 or earlier since XML is not supported yet).

Each line represents a person. It contains an ID (unique path) as the very first field (which is formed as a sequence of letters corresponding to genders of ancestors), so the ID is always unique relatively to the researcher (the person which is in the root of the hierarchy. The root person is in the first line, this is not required since the lines are independent, but desired - to identify the tree. If a descendant (a son or a daughter) of the genealogy owner will continue the research - they will just need to append their gender at the front of each line and optionally merge with the second parent tree. Also, if 2 trees are to be merged - then 2 root persons will only have to find the path between them and append the corresponding prefixes. During the merge - they may want to resolve conflicts (the final merged line with every ID on the merged final tree should have the most current understanding of the name writing and dates). This is similar to resolving conflicts during merge of two versions of any text, document or programming code.

An example of the binary tree of ancestors - in case Female (F) is in the root, in 2-dimentional printing:

[F=1]-------[FF=3]---------[FFF=7]------------[FFFF=15]-------------[FFFFF=31]
   \         \                 \-----------[FFFМ=14]
    \         \-------[FFМ=6]-----------[FFМF=13]
     \                   \-----------[FFМM=12]
      \-[FМ=2]------[FMF=5]-------[FMFF=11]
          \            \-----[FMFM=10]---------[FMFМF=21]
           \                    \----------[FMFMМ=20]
            \--[FММ=4]----[FMМF=9]-----[FMМFF=19]
                 \           \-----[FMМFM=18]
                  \--[FМММ=8]----[FМММF=17]
                        \------[FММММ=16]


The problem of 2-dimentional trees representation is that they require scrolling with depth of just few centures and so do not show the big picture and are good - only when the depth is small. They are not scalable (when the tree becomes large - the tree cannot be fit on the screen). Also it is impossible or very hard to cut subtrees from it, merge trees etc. Our format (let's name it atree) is based on unique paths from the root (like paths in computer filesystems) and is much more scalable: the path size grows as logarithm of the total tree size, is very compact and lines are always self-sufficient, they do not require any relative context. You can send just one line or a subset of lines to a relative to whom it may concern. Also the format is very convenient for haplogroups assigning. The overhead is small and is the same as in binary system in comparison with the decimal system. One page can fit a lot of information about ancestors and subtrees can be easily cut into separate pages.

A real ATree example (my updated tree in Russian is here):

M [N1c1] ... <== myself
MM [N1c1] ... /Gavrilov/ ... <== father
MF [H10a1] ... /Gavrilov (Gusev)/ <== mother
MFM ... /Gusev/ <== name is not shown for privacy
MFF [H10a1] ... /Gusev (Marinov)/ <== name is not shown for privacy
MFMF Evdokiya Nikolayevna /Gusev (?)/ (?), Milna, daughter of merchant
MFMFM Nikolay /?/ <== merchant, owned mill(s)?
MFMFF ?
MFMM Nikolay Gerasimovich /Gusev/, Vladimirskaya obl.
MFMMM ?
MFMMF ?
MFFM Iohann (Ivan) Dorimedontovich /Marinov/ (26 AUG 1874-21 AUG 1943) <== relation with his father (MFFMM) has to be clarified, was adopted, however by other sources - the son from the 1st marriage
MFFF [H10a1] Nadejda Kirillovna-Kirova /Marinova (Eryomina)/ (17 SEP 1872/SEP 1954), Vladimirskaya obl.
MFFFM Kirill /Eryomin?/
MFFFF [H10a1] /? (?)/
MFFMM Dorimedont Pahomovih /Mihailov/ (ABT 1839-02 AUG 1885) <== merchant, owned  department stores in Melenki (Vladimirskaya obl)
MFFMMM Pahom Agapovich (ABT 1807)
MFFMMMM Agap Nikiforovich (ABT 1807)
MFFMF ?
MFFMMM ?
MFFMMF ?
MMM [N1c1] Georgiy (Legor, Egor) Konstantinovich /Gavrilov/ (18 APR 1912-ABT 1942), Udmurtiya/Kirovskaya obl, Russia
MMMM [N1c1] Konstantin Ivanovich /Gavrilov/ (15 MAY 1874-ABT 1940)
MMMF Tatyana Iosifovna (Osipovna) /Gavrilov (?)/ (ABT 1876-ABT 1940)
MMMFM Iosef Andreevich, pochinok Sumsiil
MMMFF ?
MMF [U5] Domna Gavrilovna /Gavrilova (Ivshina)/ (31 AUG 1912-AFT 1942), Udmurtiya/Kirovskaya obl, Russia
MMFM Gavriil Nikolayevich /Ivshin/ (10 JUL 1887-ABT 1960)
MMFF [U5] Aleksandra Timofeevna /Ivshina (Kashina)/ (09 APR 1883-ABT 1960), Punsi-Uberi
MMFFM Timofey Egorovich /Kashin/ (~1856-09 MAR 1917)
MMFFF [U5] Marina Miheevna, Punsi-Uberi
MMMMM [N1c1] Ivan Gavrilovich /Gavrilov/ (ABT 1849-12 OCT 1896)
MMMMMM [N1c1] Gavriil? /Gavrilov/ 
MMMMMF ?
MMMMF Nataliya Ivanovna /Gavrilov (?)/ (ABT 1846-10 JAN 1906)
MMMMFM ?
MMMMFF ? 
MMFMM Nikolay Afanas'yevich /Ivshin/ (ABT 1858-18 JAN 1907)
MMFMF Marfa Ivanovna /Ivshina (?)/ (25 AUG 1854-?)
MMFMFM Ivan Vasil'evich, Vilingurd
MMFMFF Glikeriya Ivanovna
MMFMMM Afanasiy Timofeevich /Ivshin/ (?-BEF 1878), Punsi-Uberi
MMFMMF ? 
MMFMMMM Timofey
MMFMMMF ? 


The tree always shows one level of unknown terminal nodes, so if a next person will be researched - then his/her parents will be added with '?' marks which mean unknown. However you can avoid printing lines with question marks. A women may have two question marks under surname (delimited by // - as in GEDCOM format) and the one in brackets means her maiden surname. For example, if the marriage has been proved by documents and her marital surname is shown but her maiden surname is not - we have only one question mark - the one which is in the brackets as in "FirstName /Surname (?)/".

The format allows easy both-way conversions from Ahnentafel (also known as Sosa-Stradonitz) enumerating system. The problem of that format is that it uses surrogate keys (decimal numbering system counter) and it is not obvious - what is the exact path to a node - without arithmetic (geometric progression) calculations. When the depth is very big - the number may be even untraceable and may overflow computers registers. The path is much more scalable. The old system also does not specify - who is in the root: a female or a male (F or M), and it puts always 1 there. This eliminates the root node which makes the format not self-sufficient and is not convenient for automatic processing. Our format (atree) is self-sufficient (and subtrees can be safely cut from it). Also the format idea is to standardize format for genealogies exchange (spaces, delimiters, question marks for unknowns) between people as well as for automatic processing in internet.

The following Table shows correspondence between atree paths (IDs), Ahnentafel system numbers and binary numerical format for both Females and Males at the root. It is seen that when Female is on the root - then we may get binary format by direct substitution '1'<->'F', '0'<->'M'. But in case of Male - we should prevent automatic trimming of 0s at the beginning of numbers, so in case of 'M' - 1 is put as the very first digit (ignoring that 1 is for 'F').

personAhnentafel formatbinary formatatree (for women)atree (for men)
myself11FM
father210FMMM
mother311FFMF
grandfather (father's)4100FMMMMM
grandmother (father's)5101FMFMMF
grandfather (mother's)6110FFMMFM
grandmother (mother's)7111FFFMFF


Here is the correspondence between atree IDs and Ahnentafel system IDs for men for the first 4 levels (2^4=16 terminal nodes + 15 intermediate nodes):
1 M
2 MM
3 MF
4 MMM
5 MMF
6 MFM
7 MFF
8 MMMM
9 MMMF
10 MMFM
11 MMFF
12 MFMM
13 MFMF
14 MFFM
15 MFFF
16 MMMMM
17 MMMMF
18 MMMFM
19 MMMFF
20 MMFMM
21 MMFMF
22 MMFFM
23 MMFFF
24 MFMMM
25 MFMMF
26 MFMFM
27 MFMFF
28 MFFMM
29 MFFMF
30 MFFFM
31 MFFFF
...


Examples of a valid line, not showing the ID (path) prefix and skipping optional haplogroups:

FirstName
FirstName (1600)
FirstName (ABT 1600)
FirstName (14 AUG 1855-5 APR 1877)
FirstName /Surname/
FirstName /Surname/ (ABT 1800)
FirstName /Surname/ (?)
FirstName /Surname/ (?-ABT 1800)
FirstName /Surname/ (?-?)
FirstName /Surname/ (AFT 1800-?)
FirstName /Surname/ (1600-1680)
/Surname/ (28 OCT 1852-BEF 1880)
/Surname/ (?-AFT 1673)
/Surname (?)/ (28 OCT 1852-BEF 1880)
FirstName /Surname/ (1742-20 DEC 1799), Place of Origin
FirstName /Surname (MadenSurname)/ (1764-6 OCT 1813)
FirstName Another Name /Surname (MadenSurname)/ (14 AUG 1855-5 APR 1877), Place of origin
FirstName Another Name /Surname (MadenSurname)/ (14 AUG 1855-5 APR 1877) Place of origin
FirstName // (1600)
   <== nothing is known
// <== nothing is known
/?/ <== nothing is known but surname refining is the priority
FirstName //
FirstName /?/
FirstName /? (?)/
FirstName /Surname (?)/
FirstName /Surname(?)/ ?
FirstName /Surname/, ?
FirstName /Surname/ (?)
FirstName /Surname/ (?), ?
# this is a comment

# commas in the names or surnames before the names - as some languages require
# and as in GEDCOM
/Surname/, AnotherName 
/Surname/, AnotherName (1800)
/Surname/, AnotherName (1800), Place of origin

# if place of origin exists and not to confuse it with commas in names - use dates as the delimiter
Names (?), Place of origin
#or even
Names (?) Place of origin


Formal grammar:

============================
atree format grammar
============================

Grammar below is BNF-like, where terminals are delimited by " " (double-quotes), non-terminals -
by < >, and simple regular expresions like character grouping [ ], repetitions *, +, {N,M} are used.
[:blank:], [:graph:], [:alpha:], [:punct:], [:digit:] - are standard POSIX character classes.
( ) denotes grouping, ? ? - verbal explanation of a character class



<atree> ::= ( <atree_line> "\n" )*
<atree_line> ::= <comment_line> | <person_line> 
<comment_line> ::= "#" (<any_text_char>)*
<space_or_tab> :== [:blank:]
<person_line> ::= <path> <space_or_tab> (<optional_haplogroups> <space_or_tab>){0,1} <names>{0,1} <dates>{0,1} "," <location_other>{0,1}
<optional_haplogroups> ::= "[" ([:alpha:] | [:digit:]) "]"
<path> :== [FM]+
<location_other> ::= (<any_text_char>)*
<any_text_char> ::= [:alpha:] | [:punct:] | [:blank:]
<space_or_tab> ::= [:blank:]
<names> ::= <names_characters>* <surnames>{0,1} <names_characters>*
<surnames> ::= "/" <names_characters>* <maiden_names>{0,1} "/"
<names_characters> ::= ? any characters except '/', and no dates inside (valid date will be considered as names part delimiter) ?
<maiden_names> ::= "(" <names_characters>+ ")"
<dates> ::= "(" <birth_date_or_lifetime_range> ")"
<birth_date_or_lifetime_range> ::= <gedcom_year> | <full_date> | <lifetime_range> | "?"
<lifetime_range> ::= <gedcom_date> "-" <gedcom_date>
<gedcom_date> ::= <full_date> | <short_date> | <gedcom_year> | "?"
<full_date> ::= <date> [ ] <month> [ ] <year>
<short_date> ::= <month> [ ] <year>
<gedcom_year> ::= ("ABT " | "BEF " | "AFT ")* <year>
<year> ::= [:digit:]{4}
<month> ::= "JAN" | "FEB" | "MAR" | "APR" | "MAY" | "JUN" | "JUL" | "AUG" | "SEP" | "OCT" | "NOV" | "DEC"
<date> ::= [:digit:]{1,2}




Template for IDs (paths) for the first 31 nodes (4 generations):

M
MM
MF
MMM
MMF
MFM
MFF
MMMM
MMMF
MMFM
MMFF
MFMM
MFMF
MFFM
MFFF
MMMMM
MMMMF
MMMFM
MMMFF
MMFMM
MMFMF
MMFFM
MMFFF
MFMMM
MFMMF
MFMFM
MFMFF
MFFMM
MFFMF
MFFFM
MFFFF
...


A bigger template may be produced by copying of the terminal nodes twice (those with the longest path) and appending both M and F at the path end. Alternatively a Bourne shell script may be run under any Unix/Linux machine (specify depth, your gender and 'F' and 'M' in your native language - if desired):

#!/bin/sh
#user defined parameters:
depth=8
my_gender=M
female=F
male=M

nodes_number=$(echo "2^($depth+1)" | bc)
i=1
while [ $i -lt $nodes_number ] ;do
 number=$(echo "obase=2;$i" | bc)
 letters=$(printf "%0o\n" 0$number | tr '0' $male | tr '1' $female)

 #if male - change the first letter to $male:
 if [ "$my_gender" == "$male" ]; then
  letters=$(echo "$letters" | sed "s/^.\(.*\)/${my_gender}\1/")
 fi

 #printf "%d " $i 
 echo $letters
 i=`expr $i + 1`
done


If you have an existing genealogy which utilizes Ahnentafel enumeration system - it is possible to write a script which will convert Ahnentafel numbers into atree ID system. Also you could reuse the following script (bc calculator should be installed):

Usage of individual number convertor:

$ sh atree2s.sh 23567
FMFFFMMMMMMFFFF


The code:

#!/bin/sh

#user defined parameters:
my_gender=F
female=F
male=M

[ ! -n "$1" ] && {
 echo "Parameter not passed"
 exit -1;
}

if ! [[ "$1" =~ ^[0-9]+$ ]] ; then
 echo "Not a number"
 exit -1
fi

input=$1

number=$(echo "obase=2;$input" | bc)
letters=$(printf "%0o\n" 0$number | tr '0' $male | tr '1' $female)

#if male - change the first letter to $male:
if [ "$my_gender" == "$male" ]; then
 letters=$(echo "$letters" | sed "s/^.\(.*\)/${my_gender}\1/")
fi

#printf "%d " $i 
echo $letters


It may be desirable a reverse procedure: to convert atree ID to the corresponding Ahnentafel system decimal number:

$ sh s2atree.sh MFFMMMMFFFMMF
7225


The conversion script:

#!/bin/sh

#user defined parameters:
female=F
male=M

[ ! -n "$1" ] && {
 echo "Parameter not passed"
 exit -1;
}

input=$1

#substitute the first char - if male
length=$(expr length "$input" - 1)
first_char=$(expr substr "$input" 1 1)
the_rest=$(expr substr "$input" 2 "$length")
if [ "$first_char" == "$male" ]; then
 input=$(echo "$female$the_rest")
fi

binary=$(echo $input | tr $male '0' | tr $female '1')
decimal=$(echo "ibase=2;obase=A;$binary" | bc)
echo $decimal


Such scripts (or similar code) can be incorporated into another processing routine or convertor - between Ahnentafel and atree notations.

I use atree format for publishing my genealogy in genealogy forums, for sending it to relatives, discussing individual nodes, sending it to researchers and as a personal genealogy organizer for myself. Sure, additional information and metadata as well as photos and other information has been stored separately, in files, records or in a database, but this format allows to keep the main structure of the binary tree (been aligned with all the natural keys, IDs).
The format is also convenient for automatic processing by robots and who knows - may be in future robots will scan internet and will merge individual genealogies by applying heuristics, building genealogy trees for people automatically.

About atree in Russian

aisgedcom program (among other functions processes GEDCOM into atree)
  • Add to Memories

AES encryption of files (and strings) in java with randomization of IV (initialization vector)
[info]siberean
Java encryption-decryption examples, I've seen so far in Internet, are having IV been hard coded, i.e. not changed every time. However randomization of the initialization vector (IV) is a must for AES and for strong security (WEP was compromised because of hardcoding of IV). Notice that IV is not a "salt", and is not a secret, but like a cryptographic nonce - must be randomized each time.
In simple example below - IV is attached in the beginning of the stream.



import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.File;
import java.io.InputStream;
import java.io.OutputStream;
import java.security.SecureRandom;
import java.util.Arrays;

import javax.crypto.Cipher;
import javax.crypto.CipherInputStream;
import javax.crypto.CipherOutputStream;
import javax.crypto.spec.IvParameterSpec;
import javax.crypto.spec.SecretKeySpec;


public class Encryption {

	private static final int IV_LENGTH=16;

	/* A helper - to reuse the stream code below - if a small String is to be encrypted */
	public static byte[] encrypt(String plainText, String password) throws Exception {
		ByteArrayInputStream bis = new ByteArrayInputStream(plainText.getBytes("UTF8"));
		ByteArrayOutputStream bos = new ByteArrayOutputStream();
		encrypt(bis, bos, password);
		return bos.toByteArray();
	}


	public static byte[] decrypt(String cipherText, String password) throws Exception {
		byte[] cipherTextBytes = cipherText.getBytes();
		ByteArrayInputStream bis = new ByteArrayInputStream(cipherTextBytes);
		ByteArrayOutputStream bos = new ByteArrayOutputStream();
		decrypt(bis, bos, password);		
		return bos.toByteArray();
	}


	public static void encrypt(InputStream in, OutputStream out, String password) throws Exception{

		SecureRandom r = new SecureRandom();
		byte[] iv = new byte[IV_LENGTH];
		r.nextBytes(iv);
		out.write(iv); //write IV as a prefix
		out.flush();
		//System.out.println(">>>>>>>>written"+Arrays.toString(iv));

		Cipher cipher = Cipher.getInstance("AES/CFB8/NoPadding"); //"DES/ECB/PKCS5Padding";"AES/CBC/PKCS5Padding"
		SecretKeySpec keySpec = new SecretKeySpec(password.getBytes(), "AES");
		IvParameterSpec ivSpec = new IvParameterSpec(iv);
		cipher.init(Cipher.ENCRYPT_MODE, keySpec, ivSpec);    	

		out = new CipherOutputStream(out, cipher);
		byte[] buf = new byte[1024];
		int numRead = 0;
		while ((numRead = in.read(buf)) >= 0) {
			out.write(buf, 0, numRead);
		}
		out.close();
	}


	public static void decrypt(InputStream in, OutputStream out, String password) throws Exception{

		byte[] iv = new byte[IV_LENGTH];
		in.read(iv);
		//System.out.println(">>>>>>>>red"+Arrays.toString(iv));

		Cipher cipher = Cipher.getInstance("AES/CFB8/NoPadding"); //"DES/ECB/PKCS5Padding";"AES/CBC/PKCS5Padding"
		SecretKeySpec keySpec = new SecretKeySpec(password.getBytes(), "AES");
		IvParameterSpec ivSpec = new IvParameterSpec(iv);
		cipher.init(Cipher.DECRYPT_MODE, keySpec, ivSpec);

		in = new CipherInputStream(in, cipher);
		byte[] buf = new byte[1024];
		int numRead = 0;
		while ((numRead = in.read(buf)) >= 0) {
			out.write(buf, 0, numRead);
		}
		out.close();
	}


	public static void copy(int mode, String inputFile, String outputFile, String password) throws Exception {

		BufferedInputStream is = new BufferedInputStream(new FileInputStream(inputFile));
		BufferedOutputStream os = new BufferedOutputStream(new FileOutputStream(outputFile));
		if(mode==Cipher.ENCRYPT_MODE){
			encrypt(is, os, password);
		}
		else if(mode==Cipher.DECRYPT_MODE){
			decrypt(is, os, password);
		}
		else throw new Exception("unknown mode");
		is.close();
		os.close();
	}


	public static void main(String[] args){

		if(args.length<1){
			System.out.println("Pass at least one argument (filename)");		
			return;
		}
		try{
			//check files - just for safety
			String fileName=args[0];
			String tempFileName=fileName+".enc";
			String resultFileName=fileName+".dec";

			File file = new File(fileName);
			if(!file.exists()){
				System.out.println("No file "+fileName);
				return;
			}
			File file2 = new File(tempFileName);
			File file3 = new File(resultFileName);
			if(file2.exists() || file3.exists()){
				System.out.println("File for encrypted temp file or for the result decrypted file already exists. Please remove it or use a different file name");
				return;
			}

			copy(Cipher.ENCRYPT_MODE, fileName, tempFileName, "password12345678");
			copy(Cipher.DECRYPT_MODE, tempFileName, resultFileName, "password12345678");

			System.out.println("Success. Find encrypted and decripted files in current directory");
		}
		catch(Exception e){
			e.printStackTrace();
		}
	}	

}




Usage:


$ javac Encryption.java



Pass any existing file, you want to encrypt through command line argument (test.sh in the following example):

$ java Encryption test.sh
Success. Find encrypted and decripted files in current directory


Encrypted file (test.enc):
$ cat test.sh.enc
&X▒b▒▒▒▒_▒▒$Z▒▒f▒XboM▒  ▒_f§R▒s▒♣▒▒K▒M;▒▒▒▒'L▒ZS◄;▒▒i
▒▒|VØ▒:?▒▒▒?▒9y{7"▒▒▒▒+▒▒e}▒▒yi▒▒y_/jjU:▒▒_▒ ►p▒?▒▒▒;\[lE▒▒▒▒Cpc▒46▒▒▒▒@▒<▒n▒↓I▒
▒▒s▒?b▒p▒O▒▒▒▒▒▒▒\d▒4n3'▒▒▒Y♦<▒▒▒▒▒▒>▒▒▒▒Ih▒▒▒´\▒↓_R▒vGW▒▒▒V▒▒?(Q♥G     J◄DMS▒▒▒
zC;*


Let's check the decrypted file (test.dec):

$ cat test.sh.dec
#!/bin/sh

i=0
depth=6

nodes_number=$(echo "2^$depth" | bc)

#echo "total nodes: $nodes_number"

while [ $i -lt $nodes_number ] ;do

        number=$(echo "obase=2;$i" | bc)
        printf "%0${depth}o\n" 0$number
        i=`expr $i + 1`
done


The file is readable.
  • Add to Memories

Cleaning nearby park
[info]siberean
We love hiking, but hate all the bottles, cans, and garbage that we see lying around, ruining the beauty of nature. So this past weekend, we decided to make use of the beautiful weather, and took a hike in a nearby park, equipped with garbage bags







Here is what was collected:

On the way back:



And finally the garbage bags were left on the curb:






Unfortunately, we couldn't pick up big pieces of dumped garbage. Hopefully, municipality could give us a hand.
Examples of big pieces:








Look into my mini-album for more pictures (now - without garbage) - from the same hike:

Autumn in nearby park. Last weekend of October.
  • Add to Memories

Brain process as 3 primitive mathematical operations
[info]siberean
VGavrilov
August 25, 2011

/Draft/

Abstract:
High-level cognitive functions can be decomposed into 3 primitive math operations: associative memory storing, partitioning and disjoint (in sets terminology), or surjection and injection and so it is likely that during the evolution the abilities to do those primitive functions by a brain have been evolved in different times: disjoint first (visual cortex), then storing (by means of hippocampus), and partitioning - the last. If future fMRI research will show localization of the partitioning function in a separate from the visual cortex region, for example in neocortex - it will be in favour of the model.


Associative memory (as opposed to address or location-based memory) is the very basic memory while associative learning is the very basic type of brain operations in the human brain (and not only human). The area of associative processing is the same in monkeys and in humans *) and it is done in lower, 'sensory' areas rather than in the ITC (inferior temporal cortex), how it was thought in the past *). ITC is currently considered to be responsible for more complex processing such as visual processing and complex object features and shapes.
As for the associative memory apparatus - one of the most critical regions is hippocampus, one of the so-called "middle brain" regions, i.e. pretty 'low-level' layer of the brain comparing to the neocortex, which had been developed significantly only in our closest ancestors. Hippocampus is not only critical for memory formation but it is also responsible for spatial abilities and navigation in mammals, and evolutionary it is an ancient organ.
So associative memory by itself (as a storage and basic routine) is rather not unique to human beings and appeared long time ago during the evolution for processing of basic routines and is not different from mammals or at least monkeys (which is shown by fMRI).
However, the processes - how stored associations are used subsequently in the information processing pipeline - are not well understood by neurobiologists yet.
Understanding the complex parallel processing which may involve multiple brain areas may take time, and for achieving that - some models of cognitive functions should be built.

We take a different, interdisciplinary approach to come to the models. Let's take math trying to find the most basic processing primitives, necessary for the cognitive function. By optimality of nature and evolution and Occam's razor - the real mechanisms should not be much more complex than that (a sequence of primitive functions). At least - the real mechanisms should be able to be gained by an adding of elementary useful functions as the result of collected useful mutations, not simultaneous appearance of multiple necessary functions at once, because despite of usefulness of such complex function - the probability of simultaneous appearance of multiple features would be small.
The model should be close to the real brain not only functionally but ideally it should also reflect partitioning of the real functional parts, which appeared evolutionary in different times, into separate units, as they appeared in the brain (example: memory, vision). That will allow not only modelling brain functionally but also being able to find a similar mechanism in the brain when it is not found yet, but been demanded by the model. So, when a feature, which is following from the model, will be later found in the brain - this may give a clue about adequacy of the model or vice-versa.

What is necessary for any complex processing? Work of the brain is a function. Mathematically it may be represented as a function of inputs from the environment and the internal states (brain's memory). The brain creates a "virtual world" in itself (in consciousness loop) which models the real world, based on experience (memory, been collected during the lifetime and the state of the brain been inherited). That experience is been collected as associations (the very basic and simple form of memory), which are formed in hippocampus and later stored in long-term memory. So, the associative memory is the 1st necessary function (storing function). However, hippocampus is more high-level organ than visual cortex, and it is visual cortex which makes that most basic operations (processing) function - even before memory capacity is necessary. Let's move to the math and will come to this from a different side.





From math we know that any function can be decomposed into a surjection and an injection (we can decompose to a different combination, but the motivation of taking surjection will be clear soon). Surjective function - is a function on sets if every element in domain can be obtained by the function from some element in the range.
Injection function is one that preserves distinctness. In the terms of sets - surjection and injection are partitionining and disjoint, correspondingly.
So, any complex function collapses ("divide and coquer" approach) into two simpler sequential operations (and we know that the brain and evolution "likes" sequential operations: it is easier to do 2 simpler operations one after another rather than to "invent" a complex operation).
Why we took surjection and injection?
Let's consider the first one. Partitioning (or surjection) in the material (not mathematical) world of the brain is a process of taking all objects (from a domain) and finding an attribute or a property (i.e. a function) - to divide it into non-intersecting categories. Those categories from now on can be distinguished only by that single attribute (set of attributes always can be concatenated into one) but not by different "intersecting" attributes (not to compare apples and oranges). So, this basic operation, surjection, is "simplifying" the world, without which the complex world could not be understood. In other words - it is "encoding" of the world into a model (in case of a humans - into linguistic model). Simple example: all objects are divided into black, white and colored, three not intersecting sets. There are no objects which exist in co-domain but not in the image (range), i.e. each object is attributed by one of those 3 "encodings": "black", "white" or "colored". It will require a very short message - to transfer or store the information in memory using such efficient encoding. It may be seen that it is partitioning, i.e. a surjection.
We could do the following formal decomposing of the surjection into projection and bijection, but the latest in the "brain world" may be understood as a simple re-mapping or changing of the language, or attribute names, so we can ignore that treating surjection and projection as synonyms (in our "brain world"). So, let's leave surjection (partitioning) as a basic operation of the brain.
We should notice, that dichotomy, i.e. partitioning of a set into 2 disjoint sets - is a special case of more general partitioning, a "binary" version of it.
Starting from the most basic, association operation, multiple associations may lead to one "right-side" of the association, ending in the same category, becoming indistinguishable. With time - "neurons that fire together - wire together" and we have a compressed grammar (of the "left-sides") to manipulate the "right-sides". We should notice here that here we speak about the model of the world (in a brain) as a non-contradictory, deterministic function, ignoring changing of the brain state in time, random fluctuations (which are not detected) and voluntary agreed contradictions (politics) and considering only a consistent non-contradictive "virtual world". At least in a short time of brain activity this is a reasonable assumption.
Phenomenon of organizing associations into sets may be also named as cluster analysis. If something does not have an association - it must be included into some set, so the whole range is consisting of non-intersecting regions (dichotomies: "day" and "night", "bad" and "good", as well as non-dichotomies: "black", "white" and "color", 4 seasons, 12 months etc). However, "compression" must not be only by means of a human language, and in case of animals it may be a reflex.

The second necessary addendum to surjection (to be able to simulate any function) is injection. Injection is a one-to-one function which preserves distinctness and never maps two elements in the domain to one element in the co domain (which surjection do). It does not require image to "fill" the co-domain, so there may be elements in co-domain which are not in the image (which do not have associations yet), however if 2 "left-sides" are distinct - so are the "right-sides" (of the association or a function). This is a function of distinguishing 2 different objects (taken arbitrarily), finding their difference and putting into disjoint sets which are not intersecting. It is intuitively obvious that it is also a "generator" of new features or properties of objects, of new terms (necessarily "invented" as a consequence of the distinctness of the two objects), the source of increasing the vocabulary and new connections in the brain. This also does not require assigning human language labels, or verbal attributes to each category and role of "labels" or "language" may be played by reflexes.

Mathematicians know that any complex function may be decomposed into superposition of surjection and injection. So, it is possible that any arbitrarily complex function (and a cognitive function is not an exception) may be also decomposed into 3 sequential operations: storing associative memories, partitioning and disjoint (on sets of objects been referenced by the associative memory). If fMRI scanning in the future will confirm that partitioning takes place in a separate field from disjoint area (which is processed by the early stages of visual cortex *) - it might be a clue that the above mentioned model is adequate and those primitive cognitive functions were developed by evolution in separate known stages: first, after vision emerged - appeared an ability for disjoint function, then - hippocampus allowed to store more associative memories, then the partitioning was added to the set. On the latest stages - all those 3 functions allowed to model and process any arbitrarily complex function with subsequent encoding it using more advanced (linguistic) encoding but still using the same basic elementary mechanisms (including associations). With the most advanced cognitive abilities - we, humans, still use those primitive operations, inherited on different stages of evolution: disjoint (injection), partitioning (surjection) and association (for both encoding and storage).

Added (Dec 1, 2011):
One of candidates for the surjection region in the brain is right fusiform gyrus which is responsible for object agnosia, i.e. inability to distinguish and recognize abstract objects.

*)
http://www.sciencedirect.com/science/article/pii/S0896627311004818
http://www.sciencedaily.com/releases/2007/03/070314134812.htm
http://www.apa.org/science/about/psa/2005/02/suzuki.aspx
http://www.ncbi.nlm.nih.gov/pubmed/16343950
http://jn.physiology.org/content/93/1/603.full
http://www.cell.com/neuron/abstract/S0896-6273%2811%2900481-8
  • Add to Memories

Bourne shell scripting
[info]siberean

Join of two files (without sorting)


Problem: make a join (as in databases) of couple of files by 1st field in the first file and 4th field in the second file, without changing (sorting) of the second file.

$ cat 1.txt
1 11 111
2 22 222
7 77 777

$ cat 2.txt

a       rs3094315       0       3
b       rs2887286       0       1
d       rs6685064       0       2
c       rs3094315       0       7


Files have different delimiters (1st file uses spaces while 2nd file uses tabs).
Also files have headers (not shown for clarity). The first file is a lookup table and can be unsorted as well (but all lines are unique by the first field).

Result (after join) should look like:

a       rs3094315       0       3
b       rs2887286       0       1       11      111
d       rs6685064       0       2       22      222
c       rs3094315       0       7       77      777



If sorting was allowed - we would use built-in Unix "join" command. However the 2nd file cannot be rearranged and should preserve the order. We'll use shell script (old and portable good friend, Bourne shell).

#!/bin/sh
while read a b c d 
do 
	e=`grep "^$d " 1.txt | sed "s/^$d \(.*\)$/\1/;s/ /\t/g"`
       	echo -e "$a\t$b\t$c\t$d\t$e"
done<2.txt

The first line assigns to variable e the result returned by "grep", piped to "sed". "grep" extracts only the lines which correspond to 4th field (we have red it into variable d) and since we need a substitution (not the line as is) - we pipe the line to "sed". The latest has 2 expressions: 2 substitutions, delimited by column. The first substitution cuts off the 1st field (including the space), leaving the rest, while the second expression substitutes spaces by tabs (since we need tabs on output). We could pipe to "tr ' ' '\t'" instead by reusing of existing "sed" process is more efficient.


The same as one-liner:

$ while read a b c d; do e=`grep "^$d " 1.txt | sed "s/^$d \(.*\)$/\1/;s/ /\t/g"`; echo
-e "$a\t$b\t$c\t$d\t$e"; done<2.txt

a       rs3094315       0       3
b       rs2887286       0       1       11      111
d       rs6685064       0       2       22      222
c       rs3094315       0       7       77      777


The result is what is as expected.
It is possible to optimize the expressions (adding -m 1 to grep - to skip the 1st file after the proper line finding etc) but we leave this for the reader.

We used grep and sed as a piped subprocesses, so the total processing is kinda slow on big files. awk will be much faster:

$ awk 'BEGIN {FS=" ";OFS="\t"};NR==FNR{a[$1]=$2 OFS $3;next};{print $0,a[$4]}'
1.txt 2.txt

a       rs3094315       0       3
b       rs2887286       0       1       11      111
d       rs6685064       0       2       22      222
c       rs3094315       0       7       77      777


Other options (to avoid subprocesses in Bourne shell and repeated reads of the first file) are to use bash ver.4 or perl or any scripting language (where associative arrays exist) or even a general-purpose language, especially if very big files will be processed frequently. But for most of the tasks - the approach above may work.


Generating binary numbers of specified number of digits



#!/bin/sh

i=0
depth=10

nodes_number=$(echo "2^$depth" | bc)

echo "total nodes: $nodes_number"

while [ $i -lt $nodes_number ] ;do

	i=`expr $i + 1`
	number=$(echo "obase=2;$i" | bc)
	printf "%0${depth}o\n" 0$number
done



Complex example with POSIX getopts, help etc. (also shows - how to structure big scripts)


Memo saver (implementation of scalable storage for indexed text chunks and bookmarks.
Source also can be downloaded from http://sourceforge.net/projects/ais/files/old/ais-scripts/ais-scripts-0.5/ais-scripts-0.5.tar.gz/download




file common
------------------------------8<-------------------------------------

#!/bin/sh
############################################################
#
# Common file sourced to all the scripts manipulating file
# based database
#
# $Id: common,v 1.6 2005/01/28 00:27:25 sourcer Exp $
############################################################

DEFAULT_INDEX="INDEX"


#where to find INDEX. Set default if not passed in INDEX env var
[ ! -n "$INDEX" ] && {
    #echo "WARN: $INDEX env var is not set." 
	#echo "You can set it through: export INDEX=/dir/of/index"
	#echo "The default will be used"
	#echo

#if we are putting this set of scripts into java/c versions - we
#should use INDEX database which is above one level
	if [ -f "../server" ] ;then
		INDEX="../ISROOT/INDEX"
	else
		INDEX="$DEFAULT_INDEX"
	fi
}


[ ! -n "$MAX_FILES_IN_DIR" ] && {
	MAX_FILES_IN_DIR=100 #max files in a dir. Needed for non XFS/jfs fs
	#actual number will be bigger on alphabet (MAX_FILES_IN_DIR+alphabet_size)
	#because one-letter words for 1 level, 2-letter words for 2nd level should be placed somewhere anyway ;)
}

DEFAULT_LEVEL=0
[ ! -n $DEBUG ] && DEBUG=$DEFAULT_LEVEL

COMMON_DEFINED=yes
#exit codes
E_WRONGARGS=2
E_INTERNAL=3

DEBUG=0

#short usage examples list
usage(){
	cat << EOF
Usage: 

   get 
                   (get resources indexed by )
   get -k 
                   (the same as before)
   get -e ""
                   (the whole sentence used as a )
   get   
                   (key1 & key2 & key3 must be in the text)
   put -k  -v 
                   (put value indexing it by )    
   put -k key value
                   (the same as before)
   put -k    -v 
   
   put -e -k "" -v 

EOF
}

help(){
	cat << EOF
	
Options:
    -h         Print this help screen
    -d         Debug this script
    -i   Redefine INDEX location
    -k    Pass the key (with 'get' or 'put')
    -v  Pass the value (with 'put')
    -e         Treat passed keys as a sentence (one key vs set of keys) *
    -m         Redefine max files in directories (for non-scalable fs)
            * not implemented yet	
EOF
}

IS_SENTENCE=

while getopts i:k:v:m:hde o
do	case "$o" in
	i)	INDEX="$OPTARG"
		;;
	k)	KEY="$OPTARG"
		;;
	v)	VALUE="$OPTARG"
		;;
	m)	MAX_FILES_IN_DIR="$OPTARG"
		;;
	h)	help
		exit 0
		;;
	d)	DEBUG=1
		;;
	e)	IS_SENTENCE=1
		;;
	[?]) #usage
		exit $E_WRONGARGS
		;;
	esac
done

#decrement MAX_FILES_IN_DIR since we have a special file + the_same_dir_word which are not included into the count
MAX_FILES_IN_DIR=`expr $MAX_FILES_IN_DIR - 2`

case "$DEBUG" in 7 ) set -x ;; esac

#verify whether INDEX (database of indices) exists
[ ! -d "$INDEX" ] && {
	echo "Directory $INDEX does not exist! Either create it or export in INDEX env variable the path to the index. Alternatively option -i can be used for passing the path"
	exit $E_WRONGARGS
}

shift `expr $OPTIND - 1`

#FindPath $KEY where KEY is encoded ('_' as spaces) key
#This function is used in both 'get' and 'put' so it is here
#KEYPATH is global var used for value returning and collecting the result
#it is depth-first search.
findPathForKey(){
	local keyLength=`expr length "$1"`
	KEYPATH="$INDEX" #begin from the root
	depth=1
  	while [ true ] ;do

        #even for get we need writable otherwise we'll have inconsistency 
		#[ -w "$KEYPATH/$1" ] && { #if file exists and writable - no need to go deeper
		#	return 
		#}

		[ $keyLength -le $depth ] && { #key length not shorter than depth
			return 
		}
	    
		local prefix=`expr substr "$1" $depth 1 | tr 'a-z' 'A-Z'`
		local newPath="$KEYPATH/$prefix"
		if [ -d "$newPath" ] ;then
			KEYPATH="$newPath"
			depth=`expr $depth + 1`
		else
			return
		fi
 	done
}

------------------------------8<-------------------------------------

file delete
------------------------------8<-------------------------------------

#!/bin/sh
############################################################
#
# Script for removing lines from file-based database for 
# particular resource. 
#
# Usage: ./delete pattern
# (pattern should be escaped in regular way)
# 
# Example (if external ISROOT is used, under cygwin): 
# ./delete -i /mount/archives/ISROOT 'mydoc20041224'
#
# $Id: delete,v 1.1.1.1 2005/01/20 15:19:38 sourcer Exp $
############################################################

source common

#all the rest is the pattern to find
REGEXP="$@"

[ -z "$REGEXP" ] && {
	echo -n "REGEXP for lines to be removed not set, please enter it now "
	read REGEXP
}

echo "Removing records containing \"$REGEXP\" from database at $INDEX"
echo
# print filenames only not descending into RCS, CVS and .svn dirs
find "$INDEX" -name RCS -prune -o -name CVS -prune -o -name ".svn" -prune -o -print0 | xargs grep -r0l "${REGEXP}" |
while read file
do
	line=`sed -ne "/${REGEXP}/p" "${file}"`
	sed -e "/${REGEXP}/d" "${file}" > "${file}.tmp"
	mv "${file}.tmp" "${file}"
	echo
	echo "From ${file}:"
	echo "$line"
done
------------------------------8<-------------------------------------

file get
------------------------------8<-------------------------------------
#!/bin/sh

source common

if [ -n "$KEY" ] ;then
	if [ -n "$*" ] ;then
		#we have keys in both -k and in arguments as in get -k k1 k2 k3
		#so we need to concatenate them
		set -- "$KEY" "$@"
	else #no keys in arguments
		set -- "$KEY" 
	fi
else
    [ ! -n "$*" ] && {  
		#neither passed in -k nor as arguments
		echo "No keys passed?"
		exit $E_WRONGARGS
	}
fi

[ -n "$VALUE" ] && { 
	echo "VALUE has been passed to get??? Use 'put' - for saving"
	exit $E_WRONGARGS
}

for aKey in "$@" ;do
    #encode first:
	aKey=`echo "$aKey" | tr 'A-Z' 'a-z' | tr ' ' '_' | tr '\t' '_'` 
	findPathForKey $aKey
	cat "$KEYPATH/$aKey" 2>/dev/null | sed -e '/^[ \t]*$/d'
	#[ $? ] && {
	#	echo "not found"
	#}
	echo
done

------------------------------8<-------------------------------------

file put
------------------------------8<-------------------------------------
#!/bin/sh

source common
source pump.sh

[ ! -n "$KEY" ] && {
	echo "no keys passed???"
	exit $E_WRONGARGS
}

if [ -n "$VALUE" ] ;then
	if [ -n "$*" ] ;then
		#we have values in both -v and in arguments as in put -v v1 v2 v3
		#so we need to concatenate them
		set -- "$VALUE" "$@"
	else #no keys in arguments
		set -- "$VALUE" 
	fi
else
    [ ! -n "$*" ] && {  
		#neither passed in -v nor as arguments
		echo "No values passed?"
		exit $E_WRONGARGS
	}
fi

#encoding the key
KEY=`echo "$KEY" | tr 'A-Z' 'a-z' | tr ' ' '_' | tr '\t' '_'` 

findPathForKey "$KEY"

if [ ! -f "$KEYPATH/$KEY" ] ;then #otherwise - no problem: just append

    #here we may have 3 situations:

	#1. new terminal node (leaf) - just create new file on this level
	if [ `expr length "$KEY"` -le $depth ] ;then
		:
	else

    	#2. already expanded diectory (only this file dir is not existing
		if [ -f "$KEYPATH/AIS" ] ;then
			#for future clarity depth probably should be calculated here rather
			#than reusing global var relying on late findPathForKey() call 
			
		    currentLetter=`expr substr "$KEY" $depth 1 | tr 'a-z' 'A-Z'`
			KEYPATH="$KEYPATH/$currentLetter"
			[ ! -d "$KEYPATH" ] && mkdir $KEYPATH
			
		#not expanded: number of files was below threshold
		else
		    #check whether it is below
		    filesInDir=`ls "$KEYPATH" | wc -l` 
			#we could cache filesInDir in AIS and increment - if it is worth
			if [ $filesInDir -ge $MAX_FILES_IN_DIR ] ;then
		    	expandIfNeeded #notice that $KEYPATH will be updated as needed

				#the same as before
		    	currentLetter=`expr substr "$KEY" $depth 1 | tr 'a-z' 'A-Z'`
				KEYPATH="$KEYPATH/$currentLetter"
				[ ! -d "$KEYPATH" ] && mkdir $KEYPATH
			fi		
		fi
	fi
fi

echo "$@" >> "$KEYPATH"/"$KEY"
[ $DEBUG -gt 0 ] && echo "key $KEY saved in $KEYPATH"

------------------------------8<-------------------------------------

file pump
------------------------------8<-------------------------------------
#!/bin/sh
############################################################
#
# sourced by 'put'
#
# $Id: pump.sh,v 1.5 2005/01/28 00:23:51 sourcer Exp $
############################################################

[ -z $COMMON_DEFINED ] && source common

#the same as uniq but working with any - not adjacent only lines.
#we are not supposing to have lines far (ls is already sorted) and we have
#swap anyway ;)
uniq2(){
	awk '{
		found = 0
		for ( i=0; i "$KEYPATH/AIS"
	
    #find alphabet for files which are not in dirs:
	local indexOfLetter=`expr $depth - 1`
	
	#TODO please optimize me!
    local alphabet=`find "$KEYPATH" -maxdepth 1 -type f -print | 
		sed -e 's/.*\///' |
		sed -e "/^.\{0,$depth\}$/d" |
		sed -e "s/^.\{$indexOfLetter\}\(.\)\{1\}.*/\1/" | 
		tr 'A-Z' 'a-z' | 
		uniq2` 
	
[ $DEBUG -gt 0 ] && echo "Moving files begining from: $alphabet from $KEYPATH"

	for char in $alphabet ;do
    	#(notice that we do not need to trim RCS, CVS, .svn dirs since
		#we are dealing with files only, non-recursively)

		local CHAR=`echo $char | tr 'a-z' 'A-Z'`
		local newDir="$KEYPATH/$CHAR"
		
		[ ! -d "$newDir" ] && {
			mkdir "$newDir"
		}
		find "$KEYPATH" -maxdepth 1 -type f \! -name AIS | grep -e ".*/[^/]\{$indexOfLetter\}[$char$CHAR][^/]*$" | xargs -i mv "{}" "$newDir/"  
		#somehow find+grep+xargs works but 'find -regex -exec' - was not
		
				
	done
	
}
------------------------------8<-------------------------------------

file unit_tests.sh (this is bash, not Bourne shell, sorry)
------------------------------8<-------------------------------------
#!/bin/bash
#sorry - this is bash script: please fix it to be bourne - if necessary

export INDEX=tmp

#redefine files number threshold - to use smaller number of files
export MAX_FILES_IN_DIR=3

source common

echo "Regression test starting..."

#use new temp directory:
echo "Creating tmp directory"
mkdir tmp


echo "Creating approximately 3500 keys"
i=10
while [ $i -gt 0 ] ;do
	for j in `cat testwords` ;do
		./put -k "$j" -v "/mnt/archive/20041223 #$i"
	done
	i=`expr $i - 1`
	echo $i
done
echo
echo "Creating 62000 keys"
for i in `cat dict` ;do
	./put -k "$i" -v "/this/is/test #$i"
done
echo $i
echo

#find stored before key:
#get  

------------------------------8<-------------------------------------

file files2db.sh (this is bash, not Bourne shell, sorry)
------------------------------8<-------------------------------------
#!/bin/sh
############################################################
#
# Script for copying data from file-based database to db. 
#
# For usage run: files2db.sh -h
# 
# Example (if external ISROOT is used, under cygwin): 
# sh files2db.sh -i d:/ISROOT
#
# $Id: files2db.sh,v 1.1.1.1 2005/01/20 15:19:38 sourcer Exp $
############################################################

source common

#populates database taking key and value from $1 and the rest of the line - accordingly
addRecord(){
        key=$1
        shift
        
        #remove trailing and leading spaces, tabs, white and comment lines:
	trimmedLine=`echo "$@" | awk '{gsub(/^[ \t]+|[ \t]+$/,"");print}' | grep '.' | sed -e '/^#.*$/d'`
	
	if [ -n "$trimmedLine" ]
	then
		echo "value=$trimmedLine"
		./is -k "$key" -v "$trimmedLine"
	else
		echo "blank line - skipped"
	fi
}

find ${INDEX} -type f -print | grep -v CVS | grep -v .svn |
while read file
do
	filename="`basename ${file}`"
	
	echo "Processing ${file}..."
	
	echo "key=${filename}"	
	while read line
	do
		addRecord "$filename" "$line"
	
	done <"${file}"
	
	#pipe problem reading the last line (not called in the above-mentioned loop):
	addRecord "$filename" "$line"
	
	echo ""
done
------------------------------8<-------------------------------------



Other utility scripts


Scripts below are not beautiful, not efficient but are doing their tasks.

Renaming of a bunch of files using some hardcoded into the script rules



#!/bin/sh

find . -name "*.xls" -print |
while read file
do
	newName=`echo "${file}" |
	sed -e "s/Report /R/;
		s/ Records Submitted by Submission Status//;
		s/ Errors Encountered on Submission Summary//;
	 	s/ Timeliness of Data Submissions//;
		s/ Records in Database by Record Type//;
		s/R3 /R3/;
		s/ Records in Database by Data Holding Group//;
		s/-ON-2006//;
		s/SURGICAL/SURGERY/;
		s/R1-OMHRS.xls/R1-OMHRS-OTHER.xls/;
		s/R1-NRS.xls/R1-NRS-OTHER.xls/"`

if [ "${file}" != "${newName}" ]; then	

	echo "Moving"
	echo "${file} -->"
	echo "${newName}"
	echo
	mv "${file}" "${newName}" 	
fi

done


An example of copying/renaming using regular expressions.


Copying of a bunch of files into automatically created hierarchy with a predefined directory structure (submissions periods).

#!/bin/sh
#
# Script copying files from one period to many other periods 
# (few years, few quarters) for testing
#
# $Id$
#

year2copyfrom=6
year2copyfromplus=`expr $year2copyfrom + 1`
quarter2copyfrom=q2
dirfromcopy=/cygdrive/m/unit/pa64/data_uploaded/200${year2copyfrom}-200${year2copyfromplus}/${quarter2copyfrom}
dir2copy2=/cygdrive/c/dqsc/copieddata

find "$dirfromcopy" -maxdepth 1 -name "*.txt" -printf "%f\n" |
while read file
do
	echo "processing $file"

	for year2copy2 in 3 4 5
	do

		year2copy2plus=`expr $year2copy2 + 1`

		for quarter2copy2 in q1 q2 q3 q4
       		do

			newfilename=`echo "$file" | sed -e "s/0${year2copyfrom}${quarter2copyfrom}/0${year2copy2}${quarter2copy2}/"`
			echo "	>> ${newfilename}"
			cat ${dirfromcopy}/${file} 	| sed -e "1s%200${year2copyfrom}/200${year2copyfromplus}%200${year2copy2}/200${year2copy2plus}%" 	
					| sed -e "1s%${quarter2copyfrom}%${quarter2copy2}%" 
					| sed -e "/[0-9]\{2\}-\w\w\w-[0-9]\{4\}/s/200${year2copyfrom}/200${year2copy2}/g" 
					| sed -e "/[0-9]\{2\}-\w\w\w-[0-9]\{4\}/s/200${year2copyfromplus}/200${year2copy2plus}/g" 
					| unix2dos > ${dir2copy2}/${newfilename}


		done
	done
done
  • Add to Memories

Shell scripts for DNA processing
[info]siberean
Here I'll put specific scripts, not included in http://siberean.livejournal.com/13367.html

Script processing yDNA markers to binary form



input.txt:
13 25 16 11 11-14 12 12 10 14 11 31 15 9-10 11 11 25 14 20 32 12-15-15-16 11 11 19-23 17 15 17 20 36-40 12 11 11 8 17-18 8 12 10 8 10 10 12 21-22 15 10 12 12 12 8 14 23 21 12 12 11 13 11 11 12 13

('-' and tabs will be replaced by spaces - before processing)

#!/bin/sh

inputfile="input.txt"

addRecord(){
	set -- $@

	for col in "$@"
	do
		#left here for test:
		#echo -n "$col"
		#echo -n " "
		binary=`echo "obase=2;ibase=10; $col" | bc`
		#6 leading characters padded with zeros
		printf "%06o" 0"$binary"
		#echo
	done
}

while read line
do
	#convert tabs and - to spaces (to have only spaces as delimiters)
	line=`echo "$line" | tr '\t' ' ' | tr '-' ' '`
        addRecord "$line"
done<"$inputfile"


$ sh num2bin.sh input.txt
00110101100101000000101100101100111000110000110000101000111000101101111100111100
10010010100010110010110110010011100101001000000011000011110011110100000010110010
11010011010111010001001111010001010100100100101000001100001011001011001000010001
01001000100000110000101000100000101000101000110001010101011000111100101000110000
11000011000010000011100101110101010011000011000010110011010010110010110011000011
01
  • Add to Memories

Script filling the matrix to the desired fixed number of columns
[info]siberean
I will put it here.

(fields assumed to be delimited by tab and spaces are allowed in the fields. No fields will be trimmed on output, so if the number of fields in any of the lines is bigger than chosen maximum - they will be just copied from the input file, so choose the maximum appropriately):

$ cat test
1       2       3       4
1       2       3       4       5       6
11      2 2     3


$ ./test.sh
1|2|3|4|0|0|
1|2|3|4|5|6|
11|2 2|3|0|0|0|
0|0|0|0|0|0|
0|0|0|0|0|0|

$ cat test.sh

#!/bin/sh
file="test"
NUM_COLS_REQUIRED=6

addRecord(){

        #set tab as delimiter
        IFS="   "

        #cols counter
        colNum=1

        #distribute arguments passed to the function in positional parameters
        set -- $@

        for col in "$@"  # Doesn't work properly if "$*" isn't quoted.
        do
                echo -n "$col"
                echo -n "|" #new delimiter
                let "colNum+=1"
        done

        while [ $colNum -le $NUM_COLS_REQUIRED ]
        do
                echo -n "0"
                echo -n "|" #new delimiter
                let "colNum+=1"
        done

        echo "" #end of line
}


while read line
do
        addRecord "$line"

done<"${file}"
#add the last line (not called in the above-mentioned loop!)
addRecord "$line"


####################### END ####################################


Usage (update the script to the desired output delimiters, number of fields and input file name):
./test.sh > output

You will want to change output delimiter ('|' has been chosen jsut for demo), NUM_COLUMNS_REQUIRED, and filename. And do not forget to put actual tab (ASCII 7) under IFS="" (in this page it was substituted by spaces).

For more complex filtering - consult with: http://tldp.org/LDP/abs/html/.
I've also put my Bourne scripts here:
http://sourceforge.net/projects/ais/files/old/ais-scripts/
(it shows most of the techniques, I use it for reference, and it demonstrated getopt from the scripts as well we includes and scripts imports)
  • Add to Memories

(no subject)
[info]siberean




Modified Sun's java Properties - to support UTF-8 (internationalized labels etc)
[info]siberean
Java Properties gives a convenient and fast way to get any structured information from files. It is a natural way to structure information using named pairs, including hierarchies (i.e. any information can be presented in a human readable way). Unfortunately, the original Sun Properties specification didn't support Unicode in properties files, so it was hard to keep internationalized labels in it. Here is the fix. Hopefully, other implementations do not have this problem and properties can be used for hierarchies presenting - instead of the worst format for configuration files - XML. Example of using properties file for storing nested structures:
#ALL TRACE DEBUG WARN INFO ERROR FATAL NONE
log4j.rootCategory=TRACE, stdout, FILE

#
# info messages are going into the screen (if we run console - we want at to see
# at least some basic messages such as starting..., finished.
#
log4j.appender.stdout.Threshold=TRACE
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout

#
# only errors are going into the file, so if the file is not empty - there are
# errors.
#
log4j.appender.FILE.Threshold=ERROR
log4j.appender.FILE=org.apache.log4j.RollingFileAppender
log4j.appender.FILE.File=system.log
log4j.appender.FILE.layout.ConversionPattern=%d %-5p [%c] %m%n
log4j.appender.FILE.MaxFileSize=100000KB
log4j.appender.FILE.layout=org.apache.log4j.PatternLayout
log4j.appender.FILE.MaxBackupIndex=3
log4j.appender.FILE.Append=true
and the file is readable and robust (try to remove some delimited. Now try to write the same file in xml, read it in a couple of XML viewers (which may deal with white-spaces as they like), try substitute something (automation by scripts) using sed, grep etc in xml and compare... Using unicode in properties - one can store anything and if the file is too big - easily partition it, use different approaches of incremental loading on demand etc.

/**
 * Copyright (C) 2004 SourcePortal Inc. All rights reserved.
 *
 * Licensed under the Apache License:
 * http://www.apache.org/licenses/LICENSE-2.0
 * you may not use this file except in compliance with the License.
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.sp.is.util;

import java.util.Properties;
import java.util.Hashtable;
import java.util.Enumeration;
import java.util.Date;
import java.io.FileOutputStream;
import java.io.OutputStreamWriter;
import java.io.FileInputStream;
import java.io.PrintStream;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.OutputStream;
import java.io.BufferedWriter;
import java.io.PrintWriter;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.InputStream;
import java.io.DataInputStream;
import java.io.StringReader;
import java.io.BufferedInputStream;

/**
 * Properties class which can read properties from file, applet or CGI/servlet
 * and supports unicode (Sun's version does not)
 *
 * @since jdk1.0
 * @version $Id: PropertiesImpl.java,v 1.2 2005/01/01 22:30:30 sourcer Exp $
 */
public class PropertiesImpl extends UCProperties{

  /**
   * Loads parameters from static file by means of ResourceLoader
   */
  public void loadAsResource(String resource)throws Exception{

     mergeWith(string2Properties(resourceToString(resource)));
  }

  /**
   * Loads all properties from String object (deserialization)
   */
  public void loadFromString(String s){

     mergeWith(string2Properties(s));
  }

  public static Properties string2Properties(String s){

    Properties props=new Properties();
    ByteArrayInputStream bais=null;
    try{
      bais=new ByteArrayInputStream(s.getBytes());
      props.load(bais);
    }
    catch(Exception e){
      System.out.println("PropertiesImpl::string2Properties:"+e);
    }    
    finally{
      try{bais.close();}catch(Exception e){}
    }
    return props;
  }

  /**
   * Aux routine transforming properties into String
   */
  public static String props2String(Properties props){

    String result=null;
    ByteArrayOutputStream baos=null;
    try{
      baos=new ByteArrayOutputStream();
      props.save(baos, "");
      result=baos.toString();
    }
    catch(Exception e){
      System.out.println("PropertiesImpl::props2String:"+e);
    }
    finally{
      try{baos.close();}catch(Exception e){}
    }
    return result==null?"":result;
  }  

  /**
   * Save all properties into one String 
   */
  public String saveAsString(){

    String result=null;
    ByteArrayOutputStream baos=null;
    try{
      baos=new ByteArrayOutputStream();
      save(baos, "");
      result=baos.toString();
    }
    catch(Exception e){
      System.out.println("PropertiesImpl::serialize:"+e);
    }
    finally{
      try{baos.close();}catch(Exception e){}
    }
    return result==null?"":result;
  }

  /**
   * Loads all properties from String object (deserialization)
   */
  public static UCProperties string2UCProperties(String s){

    UCProperties props=new UCProperties();
    BufferedReader br=null;
    try{
      br=new BufferedReader(new StringReader(s));
      ((UCProperties)props).load(br);

    }
    catch(Exception e){
      System.out.println("PropertiesImpl::str2Properties:"+e);
    }
    finally{
      try{br.close();}catch(Exception e){}
    }
    return props;
  }

  /**
   * Saves properties as file
   */
  public void saveIntoFile(String fileName)throws Exception{

    FileOutputStream fos=new FileOutputStream(fileName);
    save(fos,"");
    fos.close();
  }

  /**
   * Loads parameters from properties file from local file system.
   */
  public void loadFromPath(String path) throws Exception{

    load(new BufferedInputStream(new FileInputStream(path)));
  }
  
  /**
   * For debugging purposes only
   */
   /*
  public String toString(){

      StringBuffer sb=new StringBuffer();

      for(Enumeration enum = keys(); enum.hasMoreElements() ;) {
          String aKey=(String)enum.nextElement();
          String aValue=getProperty(aKey);
          sb.append(aKey);
          sb.append('=');
          sb.append(aValue);
          if(enum.hasMoreElements())sb.append(" ");
          sb.append("\n");
      }
		  return sb.toString();
  }
*/

  /**
   * Application entry point -- for debugging only
   */    /*
  public static void main(String args[]) {

    try{
       PropertiesImpl p=new PropertiesImpl();
       p.load("system.properties");
       System.out.println(p.toString());

       Thread.sleep(100000);
    }catch (Throwable t) {
       System.out.println("uncaught exception: " + t);
       t.printStackTrace();
       try{Thread.sleep(50000);}catch(InterruptedException ie){}
    }
  }        */
  
/*
  public static void main(String args[]) {

    try{

       PropertiesImpl p=new PropertiesImpl();
       p.put("1","aaa"); p.put("2","bbb"); p.put("3","ccc");
       String s=p.save();
       System.out.println("Serialized:"+"["+s+"]");

       p.reload(s);
       System.out.println("Reloaded:"+"["+p.save()+"]");

       Thread.sleep(100000);
    }catch (Throwable t) {
       System.out.println("uncaught exception: " + t);
       t.printStackTrace();
       try{Thread.sleep(50000);}catch(InterruptedException ie){}
    }
  }
*/  

/*
  public void loadSafe(String fileName){

    InputStream in = null;
    try{
      in=new FileInputStream(fileName);
    }catch(FileNotFoundException e){
      ClassLoader loader=ClassLoader.getSystemClassLoader();
      in=loader.getResourceAsStream(fileName);
      if(in!=null){
      //if(LOG.isEnabledFor(Priority.INFO)) LOG.info ("Read properties from classpath");
      }
    }
    if(in!=null){
      try{
        load(in);
     }catch(IOException e){
       //if(LOG.isEnabledFor(Priority.ERROR))LOG.error("Error reading properties");
     }
    }else{
      //if(LOG.isEnabledFor(Priority.WARN))LOG.warn("properties cannot be found");
    }
  }
*/

   /**
    * Encodes properties for using as an applet parameter or when CR/LF
    * (default Sun's propertiers delimiters) are not appropriate
    */
    public static String encodeAP(String s){

        StringBuffer sb = new StringBuffer();

        int len = (s != null) ? s.length() : 0;
        for(int i=0; i ';'
           }
           else{                  //";"
              sb.append('\n');
              j++;
           }
        }
        else{ //regular           //any
           sb.append(buffer[j++]);
        }
      }
      //last symbol:
      if(buffer[j]==';')sb.append('\n');
      else sb.append(buffer[j]);

      return sb.toString();
    }

  /**
   * Convenience method for loading properties from file taking name and using
   * a default classloader (if opening file is not possible. Example - jar'ed 
   * properties file classes)
   */
  public String resourceToString(String res)throws Exception{

     InputStream is=null;

     ClassLoader cl=getClass().getClassLoader();
     if(cl != null){
        is=cl.getResourceAsStream(res);
     }
     else{
        //is=ClassLoader.getResourceAsStream(res);
        System.err.println("PropertiesImpl::getClassLoader() returns null! - can't load resources");
        return null;
     }

     StringBuffer sb = new StringBuffer();

     String line;
     DataInputStream dis=null;

     try{
        dis=new DataInputStream(new BufferedInputStream(is));
        line=dis.readLine();
        while(line != null) {
            sb.append(line);
            sb.append("\n");
            line=dis.readLine();
        }
     }
     catch(Exception e){
        //System.err.println("PropertiesImpl:"+e);
        throw e;
     }
     finally{
        if(dis!=null)try{dis.close(); }catch(IOException e){}
     }

     //System.out.println("ResourceLoader::"+sb.toString());
     String out = sb.toString();
     if(out.length()==0) return null;

     return out;
  }

 /**
  * Adds new properties without removing the old ones
  */
  public void mergeWith(Hashtable anotherProps){

     for(Enumeration enum = anotherProps.keys(); enum.hasMoreElements() ;) {
         String aKey=(String)enum.nextElement();
         Object aValue=anotherProps.get(aKey);
         put(aKey, aValue);
     }
  }
}

////////////////////////////////////////////////////////////////////////////////
//The following is Properties implementation class supporting
//Unicode (Sun's original java.util.Properties class is based on ByteStreams)

/**
 * Fixed Sun's class to support Unicode.
 */
class UCProperties extends Properties{

    public UCProperties() {
    }

    private static final String keyValueSeparators = "=: \t\r\n\f";
    private static final String strictKeyValueSeparators = "=:";
    private static final String specialSaveChars = "=: \t\r\n\f#!";
    private static final String whiteSpaceChars = " \t\r\n\f";


    public synchronized void load(BufferedReader in) throws IOException{

	    while (true) {
            // Get next line
            String line = in.readLine();
            if (line == null)
                return;

            if (line.length() > 0) {
                // Continue lines that end in slashes if they are not comments
                char firstChar = line.charAt(0);
                if ((firstChar != '#') && (firstChar != '!')) {
                    while (continueLine(line)) {
                        String nextLine = in.readLine();
                        if(nextLine == null)
                            nextLine = new String("");
                        String loppedLine = line.substring(0, line.length()-1);
                        // Advance beyond whitespace on new line
                        int startIndex=0;
                        for(startIndex=0; startIndex= 0) && (line.charAt(index--) == '\\'))
            slashCount++;
        return (slashCount % 2 == 1);
    }

    /*
     * Converts encoded \uxxxx to unicode chars
     * and changes special saved chars to their original forms
     */
    private String loadConvert (String theString) {
        char aChar;
        int len = theString.length();
        StringBuffer outBuffer = new StringBuffer(len);

        for(int x=0; x 0x007e)) {
                        outBuffer.append('\\');
                        outBuffer.append('u');
                        outBuffer.append(toHex((aChar >> 12) & 0xF));
                        outBuffer.append(toHex((aChar >>  8) & 0xF));
                        outBuffer.append(toHex((aChar >>  4) & 0xF));
                        outBuffer.append(toHex( aChar        & 0xF));
                    } else {
                        if (specialSaveChars.indexOf(aChar) != -1)
                            outBuffer.append('\\');
                        outBuffer.append(aChar);
                    }
            }
        }
        return outBuffer.toString();
    }


    public synchronized void save(OutputStream out, String header)  {
        try {
            store(out, header);
        } catch (IOException e) {
        }
    }

    public synchronized void store(OutputStream out, String header)
    throws IOException
    {
        BufferedWriter awriter;
        awriter = new BufferedWriter(new OutputStreamWriter(out, "8859_1"));
        if (header != null)
            writeln(awriter, "#" + header);
        writeln(awriter, "#" + new Date().toString());
        for (Enumeration e = keys(); e.hasMoreElements();) {
            String key = (String)e.nextElement();
            String val = (String)get(key);
            key = saveConvert(key, true);

	    /* No need to escape embedded and trailing spaces for value, hence
	     * pass false to flag.
	     */
            val = saveConvert(val, false);
            writeln(awriter, key + "=" + val);
        }
        awriter.flush();
    }

    private static void writeln(BufferedWriter bw, String s) throws IOException {
        bw.write(s);
        bw.newLine();
    }

    /**
     * Searches for the property with the specified key in this property list.
     * If the key is not found in this property list, the default property list,
     * and its defaults, recursively, are then checked. The method returns
     * null if the property is not found.
     *
     * @param   key   the property key.
     * @return  the value in this property list with the specified key value.
     * @see     #setProperty
     * @see     #defaults
     */
    public String getProperty(String key) {
	    return (String)super.get(key);
    }

    /**
     * Searches for the property with the specified key in this property list.
     * If the key is not found in this property list, the default property list,
     * and its defaults, recursively, are then checked. The method returns the
     * default value argument if the property is not found.
     *
     * @param   key            the hashtable key.
     * @param   defaultValue   a default value.
     *
     * @return  the value in this property list with the specified key value.
     * @see     #setProperty
     * @see     #defaults
     */
    public String getProperty(String key, String defaultValue) {
	String val = getProperty(key);
	return (val == null) ? defaultValue : val;
    }

    /**
     * Returns an enumeration of all the keys in this property list, including
     * the keys in the default property list.
     *
     * @return  an enumeration of all the keys in this property list, including
     *          the keys in the default property list.
     * @see     java.util.Enumeration
     * @see     java.util.Properties#defaults
     */
    public Enumeration propertyNames() {
	Hashtable h = new Hashtable();
	enumerate(h);
	return h.keys();
    }

    /**
     * Prints this property list out to the specified output stream.
     * This method is useful for debugging.
     *
     * @param   out   an output stream.
     */
    public void list(PrintStream out) {
	out.println("-- listing properties --");
	Hashtable h = new Hashtable();
	enumerate(h);
	for (Enumeration e = h.keys() ; e.hasMoreElements() ;) {
	    String key = (String)e.nextElement();
	    String val = (String)h.get(key);
	    if (val.length() > 40) {
                val = val.substring(0, 37) + "...";
	    }
	    out.println(key + "=" + val);
	}
    }

    /**
     * Prints this property list out to the specified output stream.
     * This method is useful for debugging.
     *
     * @param   out   an output stream.
     * @since   JDK1.1
     */
    /*
     * Rather than use an anonymous inner class to share common code, this
     * method is duplicated in order to ensure that a non-1.1 compiler can
     * compile this file.
     */
    public void list(PrintWriter out) {
	out.println("-- listing properties --");
	Hashtable h = new Hashtable();
	enumerate(h);
	for (Enumeration e = h.keys() ; e.hasMoreElements() ;) {
	    String key = (String)e.nextElement();
	    String val = (String)h.get(key);
	    if (val.length() > 40) {
		val = val.substring(0, 37) + "...";
	    }
	    out.println(key + "=" + val);
	}
    }

    /**
     * Enumerates all key/value pairs in the specified hastable.
     * @param h the hashtable
     */
    private synchronized void enumerate(Hashtable h) {

	for (Enumeration e = keys() ; e.hasMoreElements() ;) {
	    String key = (String)e.nextElement();
	    h.put(key, get(key));
	}
    }

    /**
     * Convert a nibble to a hex character
     * @param	nibble	the nibble to convert.
     */
    private static char toHex(int nibble) {
	return hexDigit[(nibble & 0xF)];
    }

    /** A table of hex digits */
    private static final char[] hexDigit = {
	'0','1','2','3','4','5','6','7','8','9','A','B','C','D','E','F'
    };
}  
//end of UCProperties class







  • Add to Memories

Universal Makefile - not to rewrite the targets (just drop files into the dir)
[info]siberean

#
# This is universal Makefile - without a need to add dependencies after
# adding of a new source file (dependencies are resolved automagically).
# This is for projects - when autotools/automake/autoconf is an overkill.
#
# It has a very basic unit test infrastructure (to test a function) - pass
# a symbol to the program (example: ./main mytest).
# Add the corresponding testing main under #ifdef - in the appropriate file
# and a line in this Makefile (taking ut1 as a template).
# This way - the main main() will be ignored and your main will be called.
# If we'll have many unit tests - we'll run them in a batch shell script.
#
#
# Copyright (C) 2001 Vasili Gavrilov
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.
#

SHELL=/bin/sh

BIN=your_project_name

OS=$(shell uname)
ifeq ($(findstring CYGWIN,$(OS)),CYGWIN)
	OS=Cygwin
endif
ifeq ($(findstring WINNT,$(OS)),WINNT)
	OS=Cygwin
endif

ifeq ($(OS), Cygwin)
	NOCYGWIN= -mno-cygwin
	#fix for recently introduced problem 
	SHELL=c:/cygwin/bin/bash.exe
endif

#consider all *.c as sources  
SOURCES.c := $(wildcard *.c)


CFLAGS= $(NOCYGWIN) -ansi -W -Wall
CC=gcc
OBJS=$(SOURCES.c:.c=.o)
LINK=gcc $(CFLAGS)
LFLAGS=-lm $()

debug : CFLAGS = $(NOCYGWIN) -ansi -W -Wall -g -Wundef
pedantic : CFLAGS = $(NOCYGWIN) -ansi -W -Wall -g -Wundef -Wstrict-prototypes -Wmissing-prototypes -Wmissing-declarations
release : CFLAGS = $(NOCYGWIN) -ansi -W -Wall -O2

.SUFFIXES:
.SUFFIXES: .d .o .h .c
.c.o: ; $(CC) $(CFLAGS) -MMD -c $*.c 

.PHONY: clean

%.d: %.c
	@set -e; rm -f $@; \
	$(CC) -M $(CFLAGS) $< > $@.$$$$; \
	sed 's,\($*\)\.o[ :]*,\1.o $@ : ,g' < $@.$$$$ > $@; \
	rm -f $@.$$$$

all release debug pedantic ut ut1: $(BIN)

$(BIN): $(OBJS)
	$(LINK) $(FLAGS) -o $(BIN) $(OBJS) $(LFLAGS)

clean:
	rm -f $(BIN) $(OBJS) *.d *~


include $(sources:.c=.d)



24.04.2011:

Recently I've also put a template project - for ANSI C projects which supposed to have a common for command line processing tools functionality (such as POSIX getopt, printing help etc): http://sourceforge.net/projects/convertctempl/

Based on a similar template (something like that I used in school, in 90s), I've made few processing/filtering tools, one of the recent is here: http://sourceforge.net/search/?q=ped2raw

Just change the name of the project, modify help and args accordingly and write necessary processing functionality in process(). Adding of more sources to the project does not require to add into the Makefile (all dependencies are calculated automatically).
Tags:
  • Add to Memories

Simple ANSI C list
[info]siberean
From my school years. Hashes and other structures may be found in other projects, but this list is not there, so I've put it as a memo here:



#include <string.h>
#include <stdlib.h>
#include <stdio.h>
#include <time.h>


struct list_node{
	void *data;
	struct list_node *next;
};

struct list{
	struct list_node *head;
	struct list_node *tail;
};


void my_list_add(struct list *const list, void *const data){

	struct list_node *new_node=(struct list_node *)malloc(sizeof(struct list_node));

	if(!data){
		fprintf(stderr, "__FILE__: Not adding null data");
		return;
	}

	if(new_node == NULL){
		printf("no memory");
		exit -1;
	}

	new_node->next = NULL;
	new_node->data = data;

	if((list->head == NULL)){ /* empty list */
		list->head = list->tail = new_node;
	}
	else if(list->head == list->tail){ /* single element in the list */
		list->head->next = new_node;
		list->tail = new_node;
	}
	else{ /* >1 element */
		list->tail->next = new_node;
		list->tail = new_node;
	}
}



void *my_list_get(const struct list *list, void* sample, int (*criteria)(void* sample, void* data)){

	struct list_node *cur;

	if(list->head == NULL)
		return NULL; /* empty list */

	cur=list->head;

	while(cur){
		if((*criteria)(sample, cur->data)==0){
			return cur->data;
		}
		cur = cur->next;
	}

	return NULL;
}

int cmp(void* sample, void* data){
	return &data-&sample;
}


void *my_list_get_first(const struct list *list){

	if(list->head == NULL)
		return NULL; /* empty list */

	return list->head->data;
}


void my_list_iterate(const struct list *list, void (*todo_impl)(void* data)){

	struct list_node *cur;

	if(list->head == NULL)
		return; /* empty list */

	cur=list->head;
	while(cur){
		(*todo_impl)(cur->data);
		cur = cur->next;
	}
}

int main(void){
	struct list list;
	
	
    	struct timeval start, middle, end;
    	time_t s,ms;
	int i;

    	gettimeofday(&start, NULL);

	for(i = 0; i<1000000; i++){
		my_list_add(&list, &i);
	}

    	gettimeofday(&middle, NULL);
    	s = middle.tv_sec - start.tv_sec;
    	ms = middle.tv_usec - start.tv_usec;
    	if(ms<0){ s--; ms+=1000000; }

	printf("records loaded in %d.%d\n", s, ms);
	for(i = 0; i<1000000; i++){
		//my_list_get(&list, &i, &cmp);
    	}

	gettimeofday(&end, NULL);
	s = end.tv_sec - start.tv_sec;
    	ms = end.tv_usec - start.tv_usec;
    	if(ms<0){ s--; ms+=1000000; }
	printf("records red in %d.%d\n", s, ms);

	return 0;
}
  • Add to Memories

Science nonfiction books and resources
[info]siberean
* Stephen Hawking "A Brief History of Time"

* Douglas Hofstadter "GEB (Gödel, Escher, Bach: An Eternal Golden Braid)"

* Douglas Hofstadter "I Am a Strange Loop"

* Neil Shubin "Your inner fish" <== introduction into how evolution is been proven, into anatomy, geology for those - who do not have a biological or medical background

* David Deutsch "The Fabric of Reality"

* "Planet Earth" series: http://topdocumentaryfilms.com/planet-earth-the-complete-bbc-series/ (on http://topdocumentaryfilms.com/ I've searched: "bbc", then found "Planet Earth")

* Valentin Turchin "The Phenomenon of Science" http://pespmc1.vub.ac.be/POSBOOK.html (in Russian: "кибернетический подход к эволюции" http://www.refal.ru/turchin/phenomenon/)

* Richard Dawkins "Selfish Gene" (in Russian: "Эгоистичный Ген" http://warrax.net/51/dawkins/cover_dawkins.html)

* Simon Singh and Edzard Ernst "Trick or Treatment" (acupuncture, homeopathy, herbal medicine, and chiropractic from scientific point of view)

* Richard Dawkins "The God Delusion" (in Russian: http://ulenspiegel.od.ua/?ID=4528)

* TED: http://www.ted.com/talks/tags , http://www.ted.com/talks/tags/id/77/page/2 etc (evolution, physics, superstrings theory, universe, biology in simple language from the creators themselves, we could not stop to see TED for a couple of months)

* David Deutsch "The Fabric of Reality" http://en.wikipedia.org/wiki/The_Fabric_of_Reality, there is also his talk on TED

* Raymond Kurzweil "The Singularity Is Near"

* Norman Doidge "The Brain That Changes Itself" <= neuroplasticity

* Dan Ariely "Predictably Irrational: The Hidden Forces That Shape Our Decisions", "The Upside of Irrationality" there are also his talks on TED (ex: "our buggy moral code")

* Sam Harris "The End of Faith"

* Jared Diamond "Guns, Germs, and Steel: The Fates of Human Societies"

* Sam Harris "The Moral Landscape"

* http://www.sciencedaily.com/ (there is "more top science stories" item in the menu which enumerates all the recent publications in "Nature", "Science" etc and explains in more details in simple language the content of the articles (more info than in "Abstract". There is always link to the original publication. I browse this site weekly to keep on the pulse of science).

* Used to listen TWIS for a year (http://www.twis.org/) but for me it was too popular, too many words, while the same news can be captured through http://www.sciencedaily.com/, so I've stopped to track TWIS at some point.

* Robert Axelrod "The Evolution of Cooperation" (some adapted short text is here: http://www.doyletics.com/art/teocart.htm, some adapted discussion also in Turchin's book http://www.refal.ru/turchin/phenomenon/)

* Russian nonfiction (popular science) portal: http://elementy.ru/

* VideoLectures http://videolectures.net/ <== indefinite source of information (most of the lectures/seminars are professional and this portal is for more professional audience than ted.com)

* Dawkins and Venter discussion (in ccm of ocean water - 1mln bacteria, 10mln viruses, in a public place - 10k different kinds of bacteria, 100k different kinds of viruses etc): http://www.youtube.com/watch?v=0lmq-cQsTDc&p=88A95DA76EC4ECD9&index=5&playnext=3

* Skeptic society http://www.skeptic.com/ emails lists of books, discussions and lectures weekly

* News of all kinds from statistical perspective: http://statspotting.com/

* Bart D. Ehrman "Jesus, Interrupted. Revealing the hidden contradictions in the Bible (and why we don't know about them)" -- from a researcher, PhD, Professor of Religious Studies, former believer-fundamentalist, graduate of a seminary Master of Divinity from a Christian institution of higher education and later, studying Bible became agnostic.

* Jeff Hawkins "On intelligence"

* Thomas Friedman "The world is flat" (globalization economy, how it works)

* Jeff Jarvis "What Would Google Do?" (21st century way of doing business)

* Nassim Taleb "Fooled by Randomness"

* Nassim Taleb "The Black swan"

* http://www.khanacademy.org/ 2700+ educational videos
(about creation of it - on TED: http://www.ted.com/talks/lang/en/salman_khan_let_s_use_video_to_reinvent_education.html)

* Joshua Foer "Moonwalking with Einstein" <-- ancient 5th century B.C. art of memory management (invented by Simonides)

* Daniel Kahneman "Thinking, fast and slow"

to be continued...
  • Add to Memories

Менагеры
[info]siberean
"Если истина может вызвать скандал, то лучше допустить скандал, чем отрицать истину"
Григорий Великий

Эпиграф:
"На селе священник был единственным грамотным. Без него не родишься, не умрёшь, ни свадьбу сыграть. Благополучие священника зависело от прихожан. Чем больше людей в селе, тем больше ему дохода. В каждой хате куча детишек и они ещё возможно яиц <в своей жизни> не ели, а батюшке десяток отдай. Через несколько дней поборы повторяются, не пропуская ни одного двора.
На пасху несут куличи святить - один оставляют в церкви. Крестьянин хранит несколько фунтов белой муки на праздник, все остальные дни питаясь чёрным, или серым хлебом. А поп на Пасху своих свиней первосортными куличами кормит. Пословица была: хороше живется котовi и поповi. И поп всегда сытый." <из "Семейных хроник", М.Т.>

А ещё бабка мне рассказывала (из крепких "подкулачников"): "а у попа ручки белые-белые. Никогда в земле не копавшиеся. Не почётно было в попы идти". Вспоминаю об этом когда вижу батюшек и дяконов рядом. И как дьяконы прислуживать умеют. Подобострастно так. И руки белые-белые, не натруженные.

Но не об этом здесь.
Обсуждаем менагеров.

Жили-были когда-то охотники. Видимо, жили примерно так, как и бушмены сегодня: выходили группой на охоту, и каждая хищная тварь (а хищников в какой-нибудь вюрмский период - были полные леса: леопарды, тигры, волки, гиены) могла напасть с каждой ветки. Любая рана - и гибель от септицемии. Даже ночью группа людей отгоняла хищников как построением частоколов (когда было на то время), так и киданием камней, о чём свидетельствуют кольца разбросанных вокруг стоянок мелких камней. По различным оценкам - в группах (которые насчитывали 10-100 человек) не хватало охотников, особенно для отгона групповых нападений стай гиен и собачьих от детей и женщин. Многие группы погибали, но те которые выживали - вырабатывали качества действовать командой, быть организованными и сплочёнными, отбирались высокие моральные качества. С изобретением механизмов лёгкого воспроизведения огня - стало легче. Но тогда возникла роль оставаться у огня и защищать женщин, или поддерживать огонь, в то время как охотники (равные между собой) охотятся и рискуют. Первые, остающиеся у огня, скорее всего были больные или раненые, слабые. Но тут возникают религии (Прометей - это отголосок тех языческих религий огня). Напомним, что это было ещё задолго до изобретения земледелия, и календарей. Итак, кто-то видимо не захотел возвращаться на рискованную тропу борьбы с хищником или тяжёлым мамонтом, могущим легко растоптать, и выбрал более лёгкий путь, придумав религию, и манипуляцию более занятыми делом охотниками, производителями пищи. И тем самым не только превратился из слабого, неценного, растерявшего квалификацию охотника в полноценного представителя общества, но и став более важным, чем главные производители. Менеджером. Вернее священнослужителем (а менеджером любой священнослужитель был по-определению до начала 20 века).
Теперь о менеджерах.

<вырезано цензурой>
  • Add to Memories

aisconvert
[info]siberean
On SourceForge:
http://sourceforge.net/projects/aisconvert/

HIR search:
http://hirs.snpology.com

PED2RAW filter (ANSI C)
http://sourceforge.net/projects/ped2raw/
http://www.softpedia.com/get/Multimedia/Graphic/Graphic-Others/ped2raw.shtml
There are PED2RAW and RAW2PED converters in aisconvert toolkit, written in java, so this is rather a deprecated code branch.

Simple ANSI C converting tools projects template (universal Makefile and common things like getopt etc):
http://sourceforge.net/projects/convertctempl/

NID:
http://sourceforge.net/projects/nid/
discussion:
http://www.linux.org.ru/forum/general/3799143

aisconfig (ANSI C):
http://sourceforge.net/projects/aisconfig/

jxlpoi:
http://sourceforge.net/projects/jxlpoi/

other projects:
http://sourceforge.net/projects/aissearch/
http://sourceforge.net/projects/ais/
(search sourceforge)
  • Add to Memories

Volga
[info]siberean
http://static.panoramio.com/photos/original/762412.jpg
http://static.panoramio.com/photos/original/18119271.jpg

http://static.panoramio.com/photos/original/23916900.jpg
http://static.panoramio.com/photos/original/13434804.jpg
http://static.panoramio.com/photos/original/13471218.jpg
http://static.panoramio.com/photos/original/15702721.jpg
http://commondatastorage.googleapis.com/static.panoramio.com/photos/original/28273412.jpg  <-- 4-deck ship
http://commondatastorage.googleapis.com/static.panoramio.com/photos/original/19424983.jpg <-- May
http://commondatastorage.googleapis.com/static.panoramio.com/photos/original/33366646.jpg <-- swallows
http://commondatastorage.googleapis.com/static.panoramio.com/photos/original/18153271.jpg <-- calm
http://commondatastorage.googleapis.com/static.panoramio.com/photos/original/18455108.jpg <-- break, winter
http://commondatastorage.googleapis.com/static.panoramio.com/photos/original/19073381.jpg <-- Volga surf
http://commondatastorage.googleapis.com/static.panoramio.com/photos/original/31977940.jpg <--channel
http://commondatastorage.googleapis.com/static.panoramio.com/photos/original/18455148.jpg <-- ice, winter
http://www.panoramio.com/photo/27553792 <--gulf

http://maps.google.com/maps?ll=55.748057,49.085197&spn=0,0.970917&t=h&z=10&lci=com.panoramio.all&layer=c&cbll=55.748057,49.085197&cbp=12,0,,0,5&photoid=po-15425261 <-- familiar sand

http://maps.google.com/maps?ll=55.669453,48.980827&spn=0,0.970917&t=h&z=10&lci=com.panoramio.all&layer=c&cbll=55.669453,48.980827&cbp=12,0,,0,5&photoid=po-6444051 <-- meteor (hydrofoil), used to go every 15 or so minutes

http://maps.google.com/maps?ll=55.504528,49.047089&spn=0,0.970917&t=h&z=10&lci=com.panoramio.all&layer=c&cbll=55.504528,49.047089&cbp=12,0,,0,5&photoid=po-16997887

http://maps.google.com/maps?ll=55.541453,49.047089&spn=0,0.970917&t=h&z=10&lci=com.panoramio.all&layer=c&cbll=55.541453,49.047089&cbp=12,0,,0,5&photoid=po-17626626 <--sunset

http://maps.google.com/maps?ll=55.478659,49.115925&spn=0,0.970917&t=h&z=10&lci=com.panoramio.all&layer=c&cbll=55.478659,49.115925&cbp=12,0,,0,5&photoid=po-18083096 <-- gulfs

http://maps.google.com/maps?ll=55.514053,49.081249&spn=0,0.970917&t=h&z=10&lci=com.panoramio.all&layer=c&cbll=55.514053,49.081249&cbp=12,0,,0,5&photoid=po-6425403 <-- blue gulf

Udmurtiya:
http://maps.google.com/maps?ll=55.748057,49.085197&spn=0,0.970917&t=h&z=10&lci=com.panoramio.all&layer=c&cbll=55.748057,49.085197&cbp=12,0,,0,5&photoid=po-15425261

http://maps.google.com/maps?ll=57.204921,51.626816&spn=0,0.970917&t=h&z=10&lci=com.panoramio.all&layer=c&cbll=57.204921,51.626816&cbp=12,0,,0,5&photoid=po-16193130 <-- Pumsi, suspension bridge

http://maps.google.com/maps?ll=57.17376,51.602955&spn=0,0.970917&t=h&z=10&lci=com.panoramio.all&layer=c&cbll=57.17376,51.602955&cbp=12,0,,0,5&photoid=po-11390167
Kilmez

http://maps.google.com/maps?ll=57.135121,51.55592&spn=0,0.970917&t=h&z=10&lci=com.panoramio.all&layer=c&cbll=57.135121,51.55592&cbp=12,0,,0,5&photoid=po-21332309  Kilmez, April

http://maps.google.com/maps?ll=57.1245,51.568108&spn=0,0.970917&t=h&z=10&lci=com.panoramio.all&layer=c&cbll=57.1245,51.568108&cbp=12,0,,0,5&photoid=po-16192722
  Kilmez, from sand Island

Photo-gallery:
http://www.udmurt.ru/en/Udmurtia%20the%20Beautiful/Photo%20Gallery.php
  • Add to Memories

Lucene links
[info]siberean
HapMap:

http://hapmap.ncbi.nlm.nih.gov/downloads/genotypes/00README.txt &lt;== format
http://hapmap.ncbi.nlm.nih.gov/downloads/genotypes/?N=D &lt;== data

Lucene and indexing:

CIA strategic investment into Lucene:
http://www.iqt.org/news-and-press/press-releases/2009/Lucid_Imagination_06-15-09.html

http://news.cnet.com/8301-13505_3-10288143-16.html

one of lectures about Lucene:
http://www.fosslc.org/drupal/node/473

Lucene Wikis in general and performance in particular:
http://wiki.apache.org/lucene-java/FrontPage?action=show&redirect=FrontPageEN
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
another discussion with numbers:
http://www.gossamer-threads.com/lists/lucene/java-user/57213

tutorial:
http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html

A wrapper around Lucene (for manual indexing and auto-indexing of file hierarchies):
http://sourceforge.net/projects/ais/

DataMining links:
http://wiki.apache.org/lucene-java/InformationRetrieval

Screenshots under Windows XP (Linux ones are on sourceforge):
siberean.wikispaces.com/AIS+screenshots
Tags:
  • Add to Memories

Open Letter to a public site, been developed on tax money, and supporting only one browser type.
[info]siberean
/Draft/

A government website been accessed by wide public - cannot be designed only for users of a single browser. This is unacceptable and inexcusable. A lot of people have Mac (with Safari browser) and Linux (with many different browsers supporting w3c standards) as well as Firefox under Windows. Open-Solaris is started to be installed on Toshiba laptops with a set of browsers. Also Google’s Chrome is gaining popularity and Opera is still in use, not speaking about tons of portable devices with their own browsers. It is estimated that before year 2020– portable browsers users will make more hits to the sites than desktop browsers and will dominate finally. Government services, been provided online, need to be accessible via a multiplicity of browsers and operating systems. This is one of accessibility requirements *) and there are other requirements – (including on the Federal level) in the processing.

Accessibility standards are described in w3c documents which is the main standard body for the Web documents/markup **) and even Microsoft is agreeing with the dominance of w3c as the standardization authority. Due to the recent accessibility requirements - all sites will soon have to be compliant with XHTML ‘Strict’ markup standard. It is unacceptable to support one browser according to the requirements and it doesn't matter - whether it's Internet Explorer, Firefox, Chrome, Safari, Opera or Konqueror. It is also a technical mistake (more than one actually) - to claim that only IE is the standard browser while other browsers are "non-standard" ones.

Finally, it is unprofessional – to support one type of browser in 2008 (I remember – many of specialized in the Web design and programming companies - even during the second half of 90s - had to support multiple browsers, while it was much harder to achieve that goal at that time than it is today, since today we have much more standardized by w3c situation as well as multiple frameworks). In fact, JavaScript didn’t change much since 1999, and CSS are pretty standardized today across the browsers (actually most of the web-frameworks are already giving complete abstraction from the browsers differences).

Professional courses today are only teaching: "Forget about IE, never develop under IE – it is not strictly standard, use ‘Strict’ level, use only Firefox for developing - for compliance with standards" etc etc). Web developers today cannot live without such tools as Firebug, Venkman, CSS editors which are Firefox add-ons.
Please, do not compromise government spending on IT, proficiency of the government technical staff, while most of the private companies are supporting most of the browsers. It is the government sites - that should be examples of standard, examples of accessibility!

The last argument of not-so-proficient web-designers who are developing only for IE - is their sites statistics. But it was shown multiple times that the real statistics is very different (if the site is not accessible by Mac Safari browser – people frequently are not making subsequent hits, not using the site, not even entering it, while all the hits are done by those who succeeded. Or people are forced to boot up a machine which has an OS supporting IE - just for entering the site). Also many browsers are pretending to be IE – for accessing IE-only sites. Real statistics is far from 90% of IE in 2008 (in some European countries – Firefox gains up to 40%, while in Canada it is also only growing). Please, do not ignore other people’s needs and make the public sites (been developed by public/tax money) accessible to all people (not only those who has spent their money buying MS versions of OS but also who has bought Mac OS or who uses Linux, FreeBSD etc)!

December 16, 2008

*) those that are readable by reading scanners - for blind people, which means using UL/LI instead of tables, alternative ways to AJAX such as fallback to regular A HREF (unobtrusive JavaScript), independence of page’s readability from colors etc</span>**)

The Treasury Board of Canada, under its "Common Look and Feel Standards and Guidelines for the Internet", has adopted the Web Content Accessibility Guidelines 1.0 Priorities 1 and 2 checkpoints. The standards also address additional accessibility issues not covered by the Web Accessibility Initiative. Federal institutions listed in Schedule 1, 1.1 and 2 of the Federal Administration Act have to comply with the Treasury Board's standards by December 31, 2002. For those unfamiliar with accessibility issues pertaining to Web page design, consider that many users may be operating in contexts very different from your own:
* They may not be able to see, hear, move, or may not be able to process some types of information easily or at all.
* They may have difficulty reading or comprehending text.
* They may not have or be able to use a keyboard or mouse.
* They may have a text-only screen, a small screen, or a slow Internet connection.
* They may not speak or understand fluently the language in which the document is written.
* They may be in a situation where their eyes, ears, or hands are busy or interfered with (e.g., driving to work, working in a loud environment, etc.).
* They may have an early version of a browser, a different browser entirely, a voice browser, or a different operating system.
  • Add to Memories

Formulas parsers
[info]siberean
#JavaScript implementation of the Dejkstra'a shanting yard:
http://ewbi.blogs.com/develops/2004/12/excel_formula_p.html


#excel internal format BIFF:
http://sc.openoffice.org/excelfileformat.pdf
#formula operators enumerated:
http://chacocanyon.com/smm/readings/referenceoperators.shtml


#theory of formula parsing:
http://www.arcanearcade.com/Flex/FlexFormula/Evaluating%20Math%20-%20Robert%20Stehwien.pdf


#evaluation of all formulas in the spreadheet using POI
http://poi.apache.org/hssf/eval.html

#formats
http://mail-archives.apache.org/mod_mbox/poi-user/200501.mbox/%3C1107167932.9585.13.camel@it-avik.in.itellix.net%3E
  • Add to Memories

Associative Indexing Service
[info]siberean
AIS - Associative Indexing Service, an application for storing bookmarks, memos, indexing of big (lifetime) archives for fast future access to the data by (personalized) keywords. In other words - it is an extension of human associative memory :)

sourceforge.net/projects/ais/

siberean.wikispaces.com/AIS+screenshots

http://freshmeat.net/projects/ais-associative-indexing-service

  • Add to Memories

Grouping and storing personal information in Internet as in a global 'Cloud'
[info]siberean

Status of This Memo
 This document proposes a method to use public Internet as a global infinite
 storage for personal information and a way to group and separate a public
 personal information from other information.
 
 
Copyright Notice
 Copyright (C) The Internet Society (2005).

         This document and translations of it may be copied and
         furnished to others, and derivative works that comment on or
         otherwise explain it or assist in its implmentation may be
         prepared, copied, published and distributed, in whole or in
         part, without restriction of any kind, provided that the above
         copyright notice and this paragraph are included on all such
         copies and derivative works.  However, this document itself may
         not be modified in any way, such as by removing the copyright
         notice or references to the Internet Society or other Internet
         organizations, except as needed for the  purpose of developing
         Internet standards in which case the procedures for copyrights
         defined in the Internet Standards process must be followed, or
         as required to translate it into languages other than English.

         The limited permissions granted above are perpetual and will
         not be revoked by the Internet Society or its successors or
         assigns.

         This document and the information contained herein is provided
         on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET
         ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR
         IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE
         OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY
         IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A
         PARTICULAR PURPOSE."

 

 

 Grouping and storing personal information in Internet as in a global 'Cloud'                

V.Gavrilov                               June 2009

 
 The Internet today (so called WEB2) is more and more used for decentralized
 storing and editing of articles, comments, blogs, bulletins, wikis which are
 hosted on different public sites. But there
 is no a standard way of extracting all the information published by a user,
 grouping and searching of such information, an easy way of separating
 ones personalized information from the miriads of published articles by
 other users.
 This grouping exists only in editor's head as personal associations
 and memory. With time passed by - some publishings are lost, URLs and
 domains may change and boookmarks are expiring, so the only way to find
 one's published information is to search the Internet again and ... try
 to find one's information by hand - from millions of results returned by a
 search engine, narrowing requests and searches.
 
 Proposed here is a simple method of grouping personal information by attaching
 of a signature to every published message where the signature is a short
 result from a one-way hash-function generated from a combination of a user's
 name, date of birth and a personalized message - to avoid collisions.
 To avoid confusion with the term "digital signature" from asymmetric
 cryptography - let's from now on name our signature/hash as NID (Network ID).
 
 
 A scenario could look like the following: while browsers still not supporting
 NID - a user could temporarily copy/paste NID during every personal post.
 In the future - user will be able to query a favourite search engine for a NID and will get
 only his personal stuff - been separated from the rest of Internet information
 because NID forms a word generated from a personal unique information which
 will highly unlikely occur somewhere else.
 In future - browsers or search engine could support a checkbox: "Personal/Public"
 and selecting of this checkbox - will allow to extract only the personal
 information from the Internet, the user ever published (using this NID).
 This is a simplest scenario. More complex usages of different groupings of
 personal information (forming multiple groups inside one personalized group by
 means of attached to the NID keywords) - may be easily shown.
 
 Let's consider a simple implementation of such NID generator.
 It is obvious that the compactness of NID is highly desired. And it seems
 that for such purpose - even 128-bit MD5 algorithm can be used successfully.
 (Actually, the published vulnerabilities of MD5 on collisions will not have
 impact on this particular usage due to the predefined and short string, from
 which the key is generated and secondly - even if a collision will occur - it will
 not have a big impact, so will not be such critical). From another side -
 22-char string hash result (base64 encoded 128-bit binary output from MD5) is
 short enough - to be stored extensively in the Internet in every post or article.
 
 Generating of a personalized NID may be hypotetically demonstrated by the following
 UNIX command * (it will be actually longer that 22-char since the standard built-in
 md5 uses less compact than base64 encoding: HEX):
 
 $ echo "Vasili Gavrilov 01011968 my cat's name - Kuzma" | md5
 $ XQgYnkgYSBwZXJzZXZlcm

 where "my cat's name - Kuzma" is a seed/salt, added for avoiding collisions
 of multiple persons having the same name and were born on the same date and also -
 to avoid generating of the NID by another person - to extract somebody else's
 information (if the name, DOB and protocol of NID generating are known).


 What should be noted here is that there should be at least minimal protocol of
 what fields are to be used and in which sequence - for generating NID - to avoid
 collisions by using of too simple feeds into md5+base64 combination.
 This RFC is intended for begining of discussing of this convention.
 
 For example - the protocol could require to write first name, last name,
 Date of Birth and "salt" - in this order (in any case, with any delimiter or
 vice versa - with predefined delimiter and casing - TBD. Benefits of that?).
 
 
 An extension of the protocol could be an attaching of a personal keyword or
 an association - to the NID.
 For example, when storing something connected with photo - the user could attach
 "photo" at the end of the NID:
  XQgYnkgYSBwZXJzZXZlcmphoto
  
 and in future - searching the Internet for this string will give a user all
 the entries ever stored with this key.
 Storing of multiple keys NIDs with multiple keywords will allow to create
 arbitrary groups and search engines will do their regular job for intersecting
 of the groups.
 
 What should be noted here is that it will be hard for another person - to get
 somebody else's information due to irreversibility of the hash function and
 existence of the 'salt' acting as a 'password'.
 No one is restricted to use more than one NID, so this is very different
 from assigning of NID to every user forever and so - this seems to be very
 privacy-friendly approach also.
 
 In future - browsers (or search engines) may support transparent appending of NID
 to the requests and searching for the past personal postings connected with "photo" and
 "vacation" will be able to achieve by just entry in a browser:
 "photo vacation" - as it is done currently against common data.
 Since above-mentioned checkbox "Personal" will be checked-in - a browser will attach
 locally saved (or saved in a Cookie or a session) NID and will send a more restricted
 request returning only the user personal postings (from multiple sites) containing
 both keywords "photo" and "vacation".
 
 We could imagine other extensions such as attaching of a counter or
 another id at the end - to allow saving of redundant (the same) data into
 multiple sites and for easier distinguishing of the duplicated data in the browser.
 This can be further elaborated.
 
 
 The above-mentioned procedure allows to use public internet as an infinite storage
 of personal data and easy extraction and grouping of such data and separating
 of public data and data saved by a person. This also transforms saving of the data
 into the Internet into saving into one global 'Cloud' and abstracts the location
 (URL may change but the information will still remain searchable).
 This will allow the personal data to be distributed equally on multiple public
 storages and in future - possibly to organize personal distributed services, working
 with personal data in really parallel way.
 
 

XQgYnkSBwZXZXZlcm

*)  A reference tool for generating such signature is here: http://sourceforge.net/projects/nid/
  • Add to Memories

Experimenting with ARM-based SheevaPlug
[info]siberean
Connected the device to the power line and RJ45 - to the router.
New IP appeared on the network and was able to connect through ssh using root/nosoup4u.

Measuring the time from powering on until network is up (on first response I'm killing ping process):

$ time ping 192.168.0.104
PING 192.168.0.104 (192.168.0.104) 56(84) bytes of data.
From 192.168.0.101 icmp_seq=90 Destination Host Unreachable
From 192.168.0.101 icmp_seq=91 Destination Host Unreachable
. . .
64 bytes from 192.168.0.104: icmp_seq=92 ttl=64 time=4.33 ms
^C
--- 192.168.0.104 ping statistics ---
92 packets transmitted, 1 received, +60 errors, 98% packet loss, time 91094ms
rtt min/avg/max/mdev = 4.338/4.338/4.338/0.000 ms, pipe 3

real    1m31.997s
user    0m0.000s
sys    0m0.004s
user@anode:~$

So, the device boots in ~1.5 minutes (I've measured twice - second time it was 1m29s)


root@debian:/mnt/tmp# cat /proc/cpuinfo
Processor    : ARM926EJ-S rev 1 (v5l)
BogoMIPS    : 1192.75
Features    : swp half thumb fastmult edsp
CPU implementer    : 0x56
CPU architecture: 5TE
CPU variant    : 0x2
CPU part    : 0x131
CPU revision    : 1
Cache type    : write-back
Cache clean    : cp15 c7 ops
Cache lockdown    : format C
Cache format    : Harvard
I size        : 16384
I assoc        : 4
I line length    : 32
I sets        : 128
D size        : 16384
D assoc        : 4
D line length    : 32
D sets        : 128

Hardware    : Feroceon-KW
Revision    : 0000
Serial        : 0000000000000000


root@debian:/mnt/tmp# free
             total       used       free     shared    buffers     cached
Mem:        515636     100112     415524          0          0      80384
-/+ buffers/cache:      19728     495908
Swap:            0          0          0

root@debian:~# dmesg
Linux version 2.6.22.18 (dhaval@devbox) (gcc version 4.2.1) #1 Thu Mar 19 14:46:22 IST 2009
CPU: ARM926EJ-S [56251311] revision 1 (ARMv5TE), cr=00053177
Machine: Feroceon-KW
Using UBoot passing parameters structure
Memory policy: ECC disabled, Data cache writeback
On node 0 totalpages: 131072
  DMA zone: 1024 pages used for memmap
  DMA zone: 0 pages reserved
  DMA zone: 130048 pages, LIFO batch:31
  Normal zone: 0 pages used for memmap
CPU0: D VIVT write-back cache
CPU0: I cache: 16384 bytes, associativity 4, 32 byte lines, 128 sets
CPU0: D cache: 16384 bytes, associativity 4, 32 byte lines, 128 sets
Built 1 zonelists.  Total pages: 130048
Kernel command line: console=ttyS0,115200 mtdparts=nand_mtd:0x400000@0x100000(uImage),0x1fb00000@0x500000(rootfs) rw root=/dev/mtdblock1 rw ip=10.4.50.4:10.4.50.5:10.4.50.5:255.255.255.0:DB88FXX81:eth0:none
PID hash table entries: 2048 (order: 11, 8192 bytes)
Console: colour dummy device 80x30
Dentry cache hash table entries: 65536 (order: 6, 262144 bytes)
Inode-cache hash table entries: 32768 (order: 5, 131072 bytes)
Memory: 256MB 256MB 0MB 0MB = 512MB total
Memory: 515456KB available (3864K code, 257K data, 104K init)
Calibrating delay loop... 1192.75 BogoMIPS (lpj=5963776)
Mount-cache hash table entries: 512
CPU: Testing write buffer coherency: ok
NET: Registered protocol family 16

CPU Interface
-------------
SDRAM_CS0 ....base 00000000, size 256MB
SDRAM_CS1 ....base 10000000, size 256MB
SDRAM_CS2 ....disable
SDRAM_CS3 ....disable
PEX0_MEM ....base e8000000, size 128MB
PEX0_IO ....base f2000000, size   1MB
INTER_REGS ....base f1000000, size   1MB
NFLASH_CS ....base fa000000, size   2MB
SPI_CS ....base f4000000, size  16MB
BOOT_ROM_CS ....no such
DEV_BOOTCS ....no such
CRYPT_ENG ....base f0000000, size   2MB

  Marvell Development Board (LSP Version KW_LSP_4.2.7_patch2)-- SHEEVA PLUG  Soc: 88F6281 A0 LE

 Detected Tclk 200000000 and SysClk 400000000
MV Buttons Device Load
Marvell USB EHCI Host controller #0: c08b8600
PEX0 interface detected no Link.
PCI: bus0: Fast back to back transfers enabled
SCSI subsystem initialized
usbcore: registered new interface driver usbfs
usbcore: registered new interface driver hub
usbcore: registered new device driver usb
NET: Registered protocol family 2
Time: kw_clocksource clocksource has been installed.
IP route cache hash table entries: 16384 (order: 4, 65536 bytes)
TCP established hash table entries: 65536 (order: 7, 524288 bytes)
TCP bind hash table entries: 65536 (order: 6, 262144 bytes)
TCP: Hash tables configured (established 65536 bind 65536)
TCP reno registered
RTC registered
Use the XOR engines (acceleration) for enhancing the following functions:
  o RAID 5 Xor calculation
  o kernel memcpy
  o kenrel memzero
Number of XOR engines to use: 4
cesadev_init(c000c894)
mvCesaInit: sessions=640, queue=64, pSram=f0000000
Warning: TS unit is powered off.
MV Buttons Driver Load
NTFS driver 2.1.28 [Flags: R/O].
JFFS2 version 2.2. (NAND) © 2001-2006 Red Hat, Inc.
io scheduler noop registered
io scheduler anticipatory registered (default)
Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing disabled
serial8250.0: ttyS0 at MMIO 0xf1012000 (irq = 33) is a 16550A
serial8250.0: ttyS1 at MMIO 0xf1012100 (irq = 34) is a 16550A
Loading Marvell Ethernet Driver:
  o Cached descriptors in DRAM
  o DRAM SW cache-coherency
  o Single RX Queue support - ETH_DEF_RXQ=0
  o Single TX Queue support - ETH_DEF_TXQ=0
  o TCP segmentation offload enabled
  o Receive checksum offload enabled
  o Transmit checksum offload enabled
  o Network Fast Processing (Routing) supported
  o Driver ERROR statistics enabled
  o Driver INFO statistics enabled
  o Proc tool API enabled
  o Rx descripors: q0=128
  o Tx descripors: q0=532
  o Loading network interface(s):
    o eth0, ifindex = 1, GbE port = 0
    o eth1, ifindex = 2, GbE port = 1

mvFpRuleDb (dfd00000): 16384 entries, 65536 bytes
Intel(R) PRO/1000 Network Driver - version 7.3.20-k2-NAPI
Copyright (c) 1999-2006 Intel Corporation.
e100: Intel(R) PRO/100 Network Driver, 3.5.17-k4-NAPI
e100: Copyright(c) 1999-2006 Intel Corporation

Warning Sata is Powered Off
NFTL driver: nftlcore.c $Revision: 1.98 $, nftlmount.c $Revision: 1.41 $
NAND device: Manufacturer ID: 0xad, Chip ID: 0xdc (Hynix NAND 512MiB 3,3V 8-bit)
Scanning device for bad blocks
Bad eraseblock 324 at 0x02880000
Bad eraseblock 332 at 0x02980000
Bad eraseblock 340 at 0x02a80000
Bad eraseblock 348 at 0x02b80000
Bad eraseblock 356 at 0x02c80000
Bad eraseblock 364 at 0x02d80000
Bad eraseblock 372 at 0x02e80000
Bad eraseblock 380 at 0x02f80000
Bad eraseblock 2372 at 0x12880000
Bad eraseblock 2380 at 0x12980000
Bad eraseblock 2388 at 0x12a80000
Bad eraseblock 2396 at 0x12b80000
Bad eraseblock 2404 at 0x12c80000
Bad eraseblock 2412 at 0x12d80000
Bad eraseblock 2420 at 0x12e80000
Bad eraseblock 2428 at 0x12f80000
Bad eraseblock 3088 at 0x18200000
Bad eraseblock 3636 at 0x1c680000
Bad eraseblock 3637 at 0x1c6a0000
Bad eraseblock 3644 at 0x1c780000
Bad eraseblock 3645 at 0x1c7a0000
Bad eraseblock 3646 at 0x1c7c0000
Bad eraseblock 3647 at 0x1c7e0000
Bad eraseblock 3648 at 0x1c800000
Bad eraseblock 3684 at 0x1cc80000
2 cmdlinepart partitions found on MTD device nand_mtd
Using command line partition definition
Creating 2 MTD partitions on "nand_mtd":
0x00100000-0x00500000 : "uImage"
0x00500000-0x20000000 : "rootfs"
ehci_marvell ehci_marvell.70059: Marvell Orion EHCI
ehci_marvell ehci_marvell.70059: new USB bus registered, assigned bus number 1
ehci_marvell ehci_marvell.70059: irq 19, io base 0xf1050100
ehci_marvell ehci_marvell.70059: USB 2.0 started, EHCI 1.00, driver 10 Dec 2004
usb usb1: configuration #1 chosen from 1 choice
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 1 port detected
ohci_hcd: 2006 August 04 USB 1.1 'Open' Host Controller (OHCI) Driver
USB Universal Host Controller Interface driver v3.0
usbcore: registered new interface driver usblp
drivers/usb/class/usblp.c: v0.13: USB Printer Device Class driver
Initializing USB Mass Storage driver...
usbcore: registered new interface driver usb-storage
USB Mass Storage support registered.
mice: PS/2 mouse device common for all mice
i2c /dev entries driver
Linux telephony interface: v1.00
Marvell Telephony Driver:
mvBoardVoiceAssembleModeGet: TDM not supported(boardId=0x9)
assembly=-1,irq=-1
mp_check_config: Error, invalid voice assembley mode
md: linear personality registered for level -1
md: raid0 personality registered for level 0
md: raid1 personality registered for level 1
raid6: int32x1     97 MB/s
raid6: int32x2    114 MB/s
raid6: int32x4    122 MB/s
raid6: int32x8    110 MB/s
raid6: using algorithm int32x4 (122 MB/s)
md: raid6 personality registered for level 6
md: raid5 personality registered for level 5
md: raid4 personality registered for level 4
raid5: measuring checksumming speed
   arm4regs  :  1071.600 MB/sec
   8regs     :   754.800 MB/sec
   32regs    :   899.600 MB/sec
raid5: using function: arm4regs (1071.600 MB/sec)
device-mapper: ioctl: 4.11.0-ioctl (2006-10-12) initialised: dm-devel@redhat.com
dm_crypt using the OCF package.
sdhci: Secure Digital Host Controller Interface driver
sdhci: Copyright(c) Pierre Ossman
mvsdmmc: irq =28 start f1090000
mvsdmmc: no IRQ detect
usbcore: registered new interface driver usbhid
drivers/hid/usbhid/hid-core.c: v2.6:USB HID core driver
Advanced Linux Sound Architecture Driver Version 1.0.14 (Thu May 31 09:03:25 2007 UTC).
mvCLAudioCodecRegGet: Error while reading register!
mvCLAudioCodecInit: Error - Invalid Cirrus Logic chip/rev ID!
Error - Cannot initialize audio decoder.at address =0xff<6>ALSA device list:
  #0: Marvell mv88fx_snd ALSA driver
TCP cubic registered
NET: Registered protocol family 1
NET: Registered protocol family 17
eth0: started
IP-Config: Complete:
      device=eth0, addr=10.4.50.4, mask=255.255.255.0, gw=10.4.50.5,
     host=DB88FXX81, domain=, nis-domain=(none),
     bootserver=10.4.50.5, rootserver=10.4.50.5, rootpath=
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
eth0: link up, full duplex, speed 100 Mbps
eth0: link down
eth0: link up, full duplex, speed 100 Mbps
Empty flash at 0x088dd338 ends at 0x088dd800
VFS: Mounted root (jffs2 filesystem).
Freeing init memory: 104K
fat: exports duplicate symbol fat_add_entries (owned by kernel)

root@debian:~# who am i
root     pts/0        2000-01-24 21:57 (192.168.0.101)
root@debian:~#


Now trying to connect through debug console (mini USB port) - through minicom.

on the host Linux machine:
$ modprobe ftdi_sio vendor=0x9e88 product=0x9e8f

for control:
$ lsmod | grep ftdi
ftdi_sio               55944  1
usbserial              39528  3 ftdi_sio
usbcore               149488  6 ftdi_sio,usbserial,ohci_hcd,ehci_hcd,uhci_hcd

As said in the manual:
$minicom -s
but instead of /dev/ttyS0 - setting to /dev/ttyUSB1

$minicom -e etc


after that I've got login and entered root/nosoup4u:


$ minicom -s

Welcome to minicom 2.3                                                   
                                                                         
OPTIONS: I18n                                                               
Compiled on Oct 24 2008, 06:37:44.                                          
Port /dev/ttyUSB1                                                           
                                                                            
                 Press CTRL-A Z for help on special keys                    
                                                                            
                                                                            
Ubuntu jaunty (development branch) debian ttyS0                             
                                                                            
debian login: AT S7=45 S0=0 L1 V1 X4 &c1 E1 Q0                              
Password:
Last login: Mon Jan 24 18:52:17 UTC 2000 from 192.168.0.101 on pts/0           
Linux debian 2.6.22.18 #1 Thu Mar 19 14:46:22 IST 2009 armv5tejl               
                                                                               
The programs included with the Ubuntu system are free software;                
the exact distribution terms for each program are described in the             
individual files in /usr/share/doc/*/copyright.                                
                                                                               
Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by           
applicable law.                                                                
                                                                               
To access official Ubuntu documentation, please visit:                         
http://help.ubuntu.com/                                                        
1 failure since last login.                                                    
Last was Mon Jan 24 21:32:17 2000 on ttyS0.                                    

root@debian:~# who am i                                                        
root     ttyS0        Jan 24 21:33                                             

root@debian:~# uname -a                                                        
Linux debian 2.6.22.18 #1 Thu Mar 19 14:46:22 IST 2009 armv5tejl GNU/Linux    

root@debian:~# apt-get update
E: Archive directory /var/cache/apt/archives/partial is missing.

root@debian:~# mkdir -p /var/cache/apt/archives/partial

now it is fine:

root@debian:~# apt-get update
Get:1 http://ports.ubuntu.com jaunty Release.gpg [189B]
Get:2 http://ports.ubuntu.com jaunty/main Translation-en_CA [2731B]
Get:3 http://ports.ubuntu.com jaunty/restricted Translation-en_CA [3970B]
Ign http://ports.ubuntu.com jaunty/universe Translation-en_CA        
Ign http://ports.ubuntu.com jaunty/multiverse Translation-en_CA
Get:4 http://ports.ubuntu.com jaunty Release [74.6kB] 
Get:5 http://ports.ubuntu.com jaunty/main Packages [1234kB]
Get:6 http://ports.ubuntu.com jaunty/restricted Packages [865B]               
Get:7 http://ports.ubuntu.com jaunty/universe Packages [4442kB]               
Get:8 http://ports.ubuntu.com jaunty/multiverse Packages [159kB]              2
Fetched 5917kB in 60s (97.9kB/s)                                             
Reading package lists... Done
root@debian:~#

Trying to nfs mount main linux host:

root@debian:~# mount 192.168.0.101:/ /mnt/tmp/
mount: wrong fs type, bad option, bad superblock on 192.168.0.101:/,
       missing codepage or helper program, or other error
       (for several filesystems (e.g. nfs, cifs) you might
       need a /sbin/mount.<type> helper program)
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

so nfs client is not installed

root@debian:~# apt-get install nfs-common portmap

root@debian:~# mount 192.168.0.101:/ /mnt/tmp/
root@debian:~# ls /mnt/tmp
bin   initrd.img      lib   mnt   root  sys  var
boot   dev     etc    initrd.img.old  lost+found  opt   sbin  tmp  vmlinuz
cdrom   home   media       proc  srv   usr  vmlinuz.old

(next time we'll do nfs boot)

A little bit NFS performance (/mnt/tmp is mounted Linux host):

From the device to Linux host:
root@debian:/mnt/tmp# dd if=/dev/zero of=/mnt/tmp/tmp/testfile bs=1024 count=100000
100000+0 records in
100000+0 records out
102400000 bytes (102 MB) copied, 9.22229 s, 11.1 MB/s

From Linux host to me:
root@debian:/mnt/tmp# dd if=/mnt/tmp/tmp/testfile of=/tmp/testfile bs=1024 count=100000
100000+0 records in
100000+0 records out
102400000 bytes (102 MB) copied, 44.1235 s, 2.3 MB/s
root@debian:/mnt/tmp#

which is obviously limited by slow flash memory:
from local to local:
root@debian:/mnt/tmp# dd if=/dev/zero of=/tmp/testfile bs=1024 count=100000
100000+0 records in
100000+0 records out
102400000 bytes (102 MB) copied, 37.7241 s, 2.7 MB/s

now from remote to remote:
root@debian:/mnt/tmp# dd if=/mnt/tmp/tmp/testfile of=/mnt/tmp/tmp/testfile2 bs=1024 count=100000
100000+0 records in
100000+0 records out
102400000 bytes (102 MB) copied, 9.46773 s, 10.8 MB/s

which is pretty good.


Now - let' employ the device as rsynching master host, making incremental synchronization of data between old Pentium MMX and Pentium 2 machines (before I used Linux host, consuming a lot of energy)
root@debian:~# apt-get install rsync
. . .
root@debian:~# scp 192.168.0.12:/ISROOT/1/backup.sh
root@debian:~# mkdir ISROOT

...have some problem with mapping between users (write permission)

root@debian:~# apt-get install openjdk-6-jdk

root@debian:~# apt-get install xorg
root@debian:~# apt-get install xterm
root@debian:~# apt-get install fluxbox
root@debian:~# apt-get install vnc-server

dpkg-reconfigure xserver-xorg

respond "Yes" to framebuffer, all other are default.


bonnie on my P4/1400:

Version 1.03c       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
anode         1504M  5677  51 25452  31 17245  19  9652  56 49905  20 292.1   5
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  4147  32 32462  22  6747  31  5210  38 +++++ +++ 10965  46
anode,1504M,5677,51,25452,31,17245,19,96
52,56,49905,20,292.1,5,16,4147,32,32462,22,6747,31,5210,38,+++++,+++,10965,46


bonnie on ARM (SheevaPlug):
Version 1.03c       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
debian           1G  3152  99  8606  80  6942  89  4999  99 28059  99  1490  96
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16   562  59 +++++ +++   937  99   864  94 +++++ +++   879 100
debian,1G,3152,99,8606,80,6942,89,4999,99,28059,99,1490.4,96,16,562,59,+++++,+++,937,99,864,94,+++++,+++,879,100



Default SheevaPlug provides Samba (I've just started it)

user@anode:~$ smbtree
Password:
WORKGROUP
    \\DEBIAN                 debian server (Samba, Ubuntu)
        \\DEBIAN\IPC$               IPC Service (debian server (Samba, Ubuntu))
        \\DEBIAN\Media              Media

  • Add to Memories

REST (as command-line of the web) vs web-services
[info]siberean
REST (Representational State Transfer) term was introduced by Roy Fielding in his doctoral dissertation in 2000 and frequently used in a constrast with so-called web-services, SOAP, messages been enveloped into xml and been sent over the same HTTP.
However, the "command line of the web" style (how I name it) have been in use since WWW introduction or HTTP/1.0, with the state machine implemented on the server-side or/and the client-side, since the begining, using CGI then.

And similar to the command-line vs GUI discussion - the command line always wins as the most compact, flexible, efficient, less-complicated and expressible way of passing information from one system to another, been a developer bottom-up approach, efficiently decoupling the systems and not bound it with a particular GUI. If we can express anything in plain English language - the same is true for the command-line, which is a set of arguments (ordered, or keyed). It can be shown/proved by direct remapping. The command-line interface is the most compact (from the human-editable, trackable ones) and it has less technology-for-technology overcomplications.

Similar to the GUI (vs command-line) - SOAP or similar protocols puts additional technological level (and so - the complexity) on top of existing compact vocabulary of nouns and verbs of business logic. Frequently - the problems, SOAP programmers are struggling with, are not connected with the complexity of the business vocabulary or complex business logic but rather - are introduced by the technology itself.

Another analog of those problems introduced by developers are Queue representations (frameworks), where a simple queue (as a data structure) can be easily created (in a minute) as a database table, with appropriate flags (to mark a state), while programmers of Enterprise queues (there is even a speciality for this!) are struggling with all kinds of problems produced by the technolgy itself, not even thinking to use a simple table (in case of closed loops of Enterprise platforms - the explanation is obvious. Read my old article about "closed loops").
Queues, Enterprise vegetables and other creatures can be described in separate articles, here we are concentrating on web-services only.

Like databases have 4 basic operations: SELECT/INSERT/DELETE/UPDATE, however CRUD term is frequently used (CRUD is for: Create/Read/Update/Delete), the Web has it's own set of operations since the begining: GET/PUT/DELETE/POST, and those operations are at the core of the HTTP protocol.
MIME is another robust messaging protocol which is in use for more than 30 years (RFC #733 etc). Again we will not start another discussion about queues and messages and why Enterprise does not use proved by time MIME for all types of queueing (SMTP relays can be easily reconfigured).

So, HTTP has the most basic operations for remote manipulation of information and it is the most robust protocol for delivering named verbs and nouns through the wire through HTTP POST. And at the same time - it is comprehensive and everything may be encoded as a list of named pairs, binary blobs in multi-part form requests and/or using custom encoders - if necessary. The base protocol is pretty flexible and convenient.
Another part is the application's state-machine and it's complexity depends on the business logic complexity and should be stored somewhere anyway. It is better to have it in the simplest form, and not to introduce additional complexity and week points to the pipeline (for deelopment and maintanence perspective).

On the other side, WSDL introduces another set of specific operations on top of HTTP method and URI pair and is not designed from the begining to describe the abovementioned generic operations GET/PUT/DELETE/POST, existing in other areas of IT in the that same form (see DB example above). So, it is rather both ambiguous and artificially over-compicated - for such basic protocol as delivering messages. URI space is complex and URI (the core of the web) becomes a second citizen in web services world. Like somebody said: web-services are not "webby" anymore, similar to the way XML is not "texty" (for the text processing standard Unix tools, used in decades), i.e. they break some existing traditions of information processing.

To be continued...
  • Add to Memories

C hash vs C++ Map vs C++ Hash Map vs java HashMap
[info]siberean
Test demonstrating that spending a day programming a custom optimized hash (as opposed with blind using of the default STL Map or HashMap which appeared to be even slower) may be worth (Especially in projects where such operations are critical: in a rule-engine, expert-system, in-memory index etc). Sure, it is possible to play with C++ code, playing with the algorithm, making it C-like code, but the code will become even less obvious and longer than C (so why to bother at all?). C is simpler language (the whole definition of the language is fitting into K&R thin book) - much more compact than C++, so there are much less artifacts, additional things to know and so - less bugs. Not speaking about overheads of C++.


$ g++ -O2 -o map map.cpp
$ ./map
records loaded in 0.233988
records red in 0.67486

$ gcc -O2 -o mapc map.c
$ ./mapc
records loaded in 0.177162
records red in 0.183764

$ head dict
a
A
AA
AAA
Aachen
aardvark
Aaren
Aarhus
Aarika
Aaron

$ cat dict | wc
62074 62074 547509

$ uname -a
Linux compaq 2.6.18-6-amd64 #1 SMP Tue Aug 19 04:30:56 UTC 2008 x86_64 GNU/Linux

$cat map.cpp  


#include <map>
#include <string>
#include <iostream>
#include <fstream>
#include <sys/time.h>

using namespace std;

int main() {
    map<string, string> hash; 
    string line;
    struct timeval start, middle, end;
    time_t s,ms;

    gettimeofday(&start, NULL);

    ifstream in("dict"); 
    while (in >> line) {
        hash[line]=line;
    }
    gettimeofday(&middle, NULL);
    s = middle.tv_sec - start.tv_sec;
    ms = middle.tv_usec - start.tv_usec;
    if(ms<0){ s--; ms+=1000000; }
    
    cout << "records loaded in " << s << "." << ms << "\n";

    for(int i=0; i<10000; i++) {
        hash["abracadabra"];
	hash["vasya"];
	hash["test"];
    }
    gettimeofday(&end, NULL);
    s = end.tv_sec - middle.tv_sec;
    ms = end.tv_usec - middle.tv_usec;
    if(ms<0){ s--; ms+=1000000; }
    
    cout << "records red in " << s << "." << ms << "\n";

    return 0;
}


$cat map.c

#include <string.h>
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <sys/time.h>

#define MAX_MAPPING_FILE_LINE 1024
#define MAX_KEY 1024

#define OUT_OF_MEM() { \
			fprintf(stderr, "Out of memory: %s:%d", __FILE__, __LINE__); \
			exit(-1); \
		}


typedef struct hash_bucket{
    char* key;
    void* data;
    struct hash_bucket* next;
} hash_bucket;

struct hash{
    long size;
    long n;
    long used;
    hash_bucket** table;
    char name[256];
};


static unsigned long calc_hash(char* str){

    unsigned long hash = 0;
    int c;

    while ((c = *str++))
        hash = c + (hash << 6) + (hash << 16) - hash;

    return hash;
}


struct hash* hash_create(long size){

	struct hash* table;
    hash_bucket** bucket;
    long i;

    table = (struct hash*)malloc(sizeof(struct hash));
    if(table == NULL){
    	OUT_OF_MEM();
    }

    if (size <= 0){
        free(table);
        return NULL;
    }

    table->size = size;
    table->table = (hash_bucket**)malloc(sizeof(hash_bucket*) * size);
    if(table->table == NULL)
    	OUT_OF_MEM();

    bucket = table->table;

    if (bucket == NULL) {
        free(table);
        return NULL;
    }

    for (i = 0; i < size; i++)
        bucket[i] = NULL;

    table->n = 0;
    table->used = 0;

    return table;
}


void* hash_put(struct hash* table, char* key, void* data){

    unsigned long val = calc_hash(key) % table->size;
    hash_bucket* bucket;

    if ((table->table)[val] == NULL) {
        bucket = (hash_bucket*)malloc(sizeof(hash_bucket));
        if (bucket == NULL)
        	OUT_OF_MEM();

        bucket->key = strdup(key);
        bucket->next = NULL;
        bucket->data = data;

        (table->table)[val] = bucket;
        table->n++;
        table->used++;

        return bucket->data;
    }

    for (bucket = (table->table)[val]; bucket != NULL; bucket = bucket->next)
        if (strcmp(key, bucket->key) == 0) {
            void* old_data = bucket->data;

            bucket->data = data;

            return old_data;
        }

    bucket = (hash_bucket*) malloc(sizeof(hash_bucket));
    if (bucket == NULL)
    	OUT_OF_MEM();

    bucket->key = strdup(key);
    bucket->data = data;
    bucket->next = (table->table)[val];

    (table->table)[val] = bucket;
    table->n++;

    return data;
}

void* hash_get(struct hash* table, char* key){

    unsigned long val = calc_hash(key) % table->size;
    hash_bucket* bucket;

	/* printf("getting attempt %s<\n", key); */

    if ((table->table)[val] == NULL)
        return NULL;

    for (bucket = (table->table)[val]; bucket != NULL; bucket = bucket->next)
        if (strcmp(key, bucket->key) == 0)
            return bucket->data;

    return NULL;
}


static int read_mapping(const char *filepath, struct hash *map){

	FILE *fp = NULL;
	char line[MAX_MAPPING_FILE_LINE];
	char *delim;
	char *valStart;
	long lineNum = 0;
	long len;
	char key_buffer[MAX_KEY + 1];
	char val_buffer[MAX_MAPPING_FILE_LINE + 1];


	if ((fp = fopen(filepath, "r")) == NULL){
		printf("Cannot read %s\n", filepath);
		return -1;
	}

	errno=0;

	while(fgets(line, sizeof(line), fp) != NULL){

		if(errno){
			fprintf(stderr, "Can't read from file %s: %s\n", filepath, strerror(errno));
			exit(-1);
		}

		if(line[sizeof(line)-1] == '\n')
			line[sizeof(line)-1] = '\0';

		len = strlen(line);

		if(len == 0){
			fprintf(stdout, "WARN: line %ld is empty -- ignored" ,lineNum);
			continue;
		}

		strcpy(key_buffer, line);
   	   	strcpy(val_buffer, line);

     		hash_put(map, strdup(key_buffer), strdup(val_buffer));

	}/* fgets */

	errno=0;

	fclose(fp);

	if(errno)
		fprintf(stderr, "Can't close file descriptor %s: %s\n", filepath, strerror(errno));

	return 0;
}

int main(void){
	struct hash* map;
    struct timeval start;
    struct timeval middle;
    struct timeval end;
    time_t s,ms;
	int i;

    	gettimeofday(&start, NULL);

	map=hash_create(100000);
	read_mapping("./dict", map);

    	gettimeofday(&middle, NULL);
    	s = middle.tv_sec - start.tv_sec;
    	ms = middle.tv_usec - start.tv_usec;
    	if(ms<0){ s--; ms+=1000000; }

	printf("records loaded in %d.%d\n", s, ms);

   	for(i=0; i<10000; i++) {
        	hash_get(map, "abracadabra");
		hash_get(map, "vasya");
		hash_get(map, "test");
    	}

	gettimeofday(&end, NULL);
	s = end.tv_sec - start.tv_sec;
    	ms = end.tv_usec - start.tv_usec;
    	if(ms<0){ s--; ms+=1000000; }
	printf("records red in %d.%d\n", s, ms);

	return 0;
}
Berkley algorithm used in custom hash (the most suitable for human language hashing): I've got 0 collisions populating English words (dict) using it and 70 colisions - with Bernstein algorithm, so I'm using the Berkley one)


 static unsigned long calc_hash(char* str){
     unsigned long hash = 0;
     int c;
     while ((c = *str  ))
         hash = c   (hash << 6)   (hash << 16) - hash;
     return hash;
 }

$ javac Map.java

$ java Map
records loaded in 1829ms
records red in 86ms

import java.io.*;
import java.util.*;

public class Map{
	public static void main(String[] args){
	   try{
		long t0 = System.currentTimeMillis();
		HashMap<String, String> hash = new HashMap<String, String>();
		BufferedReader is = new BufferedReader(new FileReader("./dict"));

		String line = is.readLine();
		while(line!=null){
			hash.put(line, line);
			line = is.readLine();
		} 		
		System.out.println("records loaded in " + (System.currentTimeMillis()-t0) + "ms");
		t0 = System.currentTimeMillis();
 		for(int i=0; i<10000; i++) {
        		hash.get("abracadabra");
        		hash.get("vasya");
        		hash.get("test");
    		}
		System.out.println("records red in " + (System.currentTimeMillis()-t0) + "ms");
	   }
           catch(Exception e){
		e.printStackTrace();
	   }
	} 
}

$ g++ -O2 -o hashmap hashmap.cpp
$ ./hashmap
records loaded in 0.521390
records red in 0.160082

$cat hashmap.cpp

#include <map>
#include <string>
#include <iostream>
#include <fstream>
#include <ext/hash_map>
#include <sys/time.h>

//std::string does not have declared the following because
//hash_map is considered to be ext, so declare it here:
namespace __gnu_cxx
{
        template<> struct hash< std::string >
        {
                size_t operator()( const std::string& x ) const
                {
                        return hash< const char* >()( x.c_str() );
                }
        };
}

using namespace std;
using namespace __gnu_cxx;

int main() {
    hash_map<string, string> hash; 
    string line;
    struct timeval start, middle, end;
    time_t s,ms;

    gettimeofday(&start, NULL);

    ifstream in("dict"); 
    while (in >> line) {
        hash[line]=line;
    }
    gettimeofday(&middle, NULL);
    s = middle.tv_sec - start.tv_sec;
    ms = middle.tv_usec - start.tv_usec;
    if(ms<0){ s--; ms+=1000000; }
    
    cout << "records loaded in " << s << "." << ms << "\n";

    for(int i=0; i<10000; i++) {
        hash["abracadabra"];
	hash["vasya"];
	hash["test"];
    }
    gettimeofday(&end, NULL);
    s = end.tv_sec - middle.tv_sec;
    ms = end.tv_usec - middle.tv_usec;
    if(ms<0){ s--; ms+=1000000; }
    
    cout << "records red in " << s << "." << ms << "\n";

    return 0;
}


So, what surprised me is that Hash Map suggested by C Plus Plus gurus as the real hash (vs red-black tree algorithm in the default Map) is even slower.
And both are more than 3 times slower than C implementation (statistics of multiple runs is now shown and times are shown only for demonstration - you can try by yourself on your machine).
Also notice that C implementation is brute-force not optimized one and it is possible to write custom allocater rather than doing malloc every time.


  • Add to Memories

Correlation between periods of world financial crises and significant shifts in computing.
[info]siberean

Or a chance for the next technological move towards more cost-efficient computing, the elimination of closed-loops of vendor-locked inefficient schemes, especially in ‘Big’ IT.


As it is well known – big IT clients (such as banking, government, utility companies, insurance companies, big retailers) are very hard to shift and reform. The reasons behind this inertia include: 1) there are many regulations and rules, so it is hard to push forth  innovations, as opposed to smaller IT organizations; 2) those are ‘Gig’ clients and they frequently employ 3rd-parties for their IT needs (consultants, subcontractor firms, IT 'solution providers'), frequently not having industry-leading experts in-house; 3) the existence of inherited infrastructures closely bound with 'solution providers', means that the companies the ‘Gigs’ are dependent on, are not constantly bombarded by  big competition (due to already-existing implementations and infrastructure). It is thereby likely that the next project will be implemented using existing infrastructure technologies in an iterative manner, rather than adopting something from the outside. This is because, at first sight, the cost of addition (of another feature or a new project) seems to be low (since infrastructure is already in existence), and similarly, the risk seems to be low – at least from the managerial perspective ('works – do not touch!').

 

But there are catches: the overall license costs are astronomic; they are periodic, permanent, and on the increase. The more projects bound with the technology, the harder it will be to cure/reform the system in the future (more projects will be necessary to promote change). It is a unidirectional move toward an infinitely growing budget, where dependency on the vendor (vendor-lock) increases as more and more projects are added. Another big issue is the fact that some 'Gigs' include not-for-profit organizations or the government. So, there is a clash between the interests of for-profit vendors interested only in the growing presence of permanent cash-flow from big organizations like governments, and the interests of governments to re-allocate limited resources to other urgent needs such as health-care or education (discussion of this is beyond the scope of this article).

 

 

However, there is another, more significant closed loop (for closed loops – read about the recursive functions theory or see Hofstadter’s book*): the people’s skills loop, which is much harder to break. Nobody who has been recently considered a highly-paid professional will want to start as a university graduate, putting passed  courses diplomas and certifications aside to start learning something completely different (in which universities are frequently giving more deep education) and to collect experience in those new areas, from scratch.

It is much easier (as it seems to many) to find another place where their old skills are still required, or more likely, to change absolutely nothing but rather push the manager to pay for another license term.

 

So the loop consists of the following:

1) The industry requires some skills to support the infrastructure that is already in existence (because there are always old projects on the basis of old technologies, somebody needs to maintain them);

2) Education centers (it is a free market!) provide the courses which are acquired by “Gigs”, where only 'Gigs' can be the main customers of such courses (universities are cheaper!). Then sub-contractors and consulting firms looking into the demands acquire professionals with the same skills;

3) Professionals push the solutions that they know, that they were taught and have had experience with (not those which are the most cost-efficient and/or which they do not know);

4) Managers (of ‘Gigs’ IT departments) look into _current_ industrial trends (who defines the trends?) and courses-on-offer (particularly, those they are most familiar with, having completed the same certifications). They then send personnel into the same fields (that again, they, the managers, are most familiar with).

5) IT personnel, having completed their certifications at the courses, implement the same closed solutions, thereby continuing the closed loop.   

 

It is a primary feature of closed loops that they cannot be resolved by themselves (Gödel theorem) and hence a solution can come only from the outside.

 

But there comes a crisis (outside forces) which cures the system!

 

Let me explain:

 

In this article I'd like to highlight the interesting correlation between crises and significant innovations in big static IT departments: periods of crisis and innovation alternate.

By crises I encapsulate both recent decades’ oil and commodity price drops and financial crises. It will also be discussed how the IT industry was able to break vendor-locks in the past and find more (cost) efficient schemes of computing, i.e. to cure itself. Solutions come from outside of those closed-loops, and the ill industry (bound with some old vendor-lock solution) accepts those innovations, thereby reforming itself.

 

We will not touch on home computing (Amiga, Atari, Mac, PC and mobiles burst of the recent years) – since in this article the focus is on the ‘Gig’ IT industry. Although workstation costs, especially licensing, may be very significant for big organizations, the assumption is that workstation or client machines are a part of the whole infrastructure and are usually closely bound with the technologies used on the central (server) machines. We are more interested in server machines, OLTP machines, central batch processing units, database servers –  machines serving big numbers of clients (clients may be both external, such as public Internet users, or internal, such as Intranet workstations users). Those central machines are usually the most expensive ones (mainframes or server farms, clusters, multi-way servers, big file and database storages) due to expensive top-hardware as well as high license costs (usually per-processor).

 

Although commercial computers appeared in the 50s – let’s begin in 1975 since before that there was a pure mainframe era, where one vendor dominated the market.

 

Let word financial crises (drops of oil prices) be marked by pipes on the year lineage below. As can be seen, periods between both crises and significant shifts toward less expensive technological innovations, span an average of 7 years (2000 is an exception: y2k bug fix is there which is a very special case – for the industry).

 

 

--75----------80----------85----------90----------95----------2000----------2005---------2010--

    |                       |                                |                       |                 |                       |

 

The single-digit numbers in the next chart represent eras or computing epochs (will be explained later), where the edges between epochs (pipes) can be characterized by significant shifts in technology/thinking towards more competition. Notice again that in this article we are interested only in ‘Big’ computing and ‘Big’ computers while by the shifts we mean innovations, ‘revolutions’ in the minds which brought some kind of decentralization in each case, introducing a less costly architectural solution for the problem.

It is interesting how those ‘significant shifts’ occur exactly in the middle of crises regardless of the cause: whether a cheaper solution is a consequence of the corresponding crises because of industrial demand (demand in a cheaper system), or vice-versa: each crisis simply releases more talented professional resources into the world, and those talents (the number of which is always limited) have more time for free creativity: to make a better system. And then the system (‘Gigs’: banks, government IT etc) just inevitably accepts those ‘solutions’, born outside as the more efficient.

Notice that a totally new system/solution cannot be created 'inside the system', it must be 'above' the system, coming from the outside (which has a deep mathematical background, beyond the scope of this paper).   

 

--75----------80----------85----------90----------95----------2000----------2005---------2010--

1|                2                  |               3                 |        4        |         5          |          6          ?

                      

           

1)      Mainframe era
        *    all systems are proprietary, expensive, centralized

*    one vendor in the market

*    only the wealthiest organizations can afford such systems because of the cost: millions of dollars

2)      Minicomputers era (3rd generation computers like DEC PDP-11/50):

*    first TSS (time-sharing systems) appear

*    no payment for central computer time

This is a paradigm shift from #1, and there is no longer a dependency on one vendor: it is a move to decentralization.

3)      Era of commercial Unixes:

*    much more 'standard' and simple system than #2

This is a revolutionary shift because compared with the old custom systems, where no standards existed, - more vendors can implement well known, and which is more significant,– simpler OS design standard.

*    there is competition between vendors (HP Unix, IBM AIX, Sun Solaris), and moving to less expenses boxes

*    standardization of DBMS: SQL standards. Oracle, supporting the standard begins dominating (as opposed to hierarchical or other databases)

*    networks boost encourages more decentralization. Databases can now be installed on different Unix systems and competition is rampant.

4)      Windows (personal home computer) arrives into enterprise as an even cheaper solution than #3 (less than tens of thousands of dollars per server rather than hundreds of thousands of dollars per Unix servers):

*    PC hardware is seen as ‘almost free’ hardware. Although there are catches (which are ignored as solvable by themselves within a few years): in particular, robustness and scalability. The main point: the software in this period is still far from being free.

*    But the programming becomes cheaper and cheaper (C++, VB, Delphi, java, html/JavaScript) rather than C, assembly in era 3. More programmers can now do it

*    More recent invention of the Web (actually it was earlier, in 91, in 3, but only now it becomes practically useful due to faster cheap networks), – everything is moved into the web: both thick and thin clients (the latest are reincarnations of ‘dumb’ terminals from 1, 2, but now – running on $1k commodity hardware). Windows dominates the client and also NT comes into the server.

*    Cheaper MS SQL server competes with Oracle.

 

5)      Java becomes the main application logic language, pushing out C++ and Y2K contributes to java dominance as the language of business logic (Cobol substitute in many projects).

*     Appearance of the first good free environments (for java for example).

*    Open-source arrival into enterprise.

*    Google proves that even clusters hardware may be commodity cheap hardware running free software.

6)      After Dot-Com crash.

*    Linux becomes scalable enough for the enterprise (after the appearance of the 2.6 Linux kernel) and starts to substitute commercial Unixes inherited from 3. Now the whole  OS on the server can be free.

*    Commodity processors (Amd64) becomes 64-bit. Now free 64-bit OS can be installed on very cheap 64-bit processors. For example - clusters (and most of the Top500 computers) are built exclusively using commodity processors (such as AMD64).

*    Free databases become scalable for enterprise (MySql, PosgreSql)

*    Open-Source arrival to the enterprise

*    Web2 (O’Reilly terminology): building of rich applications on the web. Competition in browsers area, standardization of JavaScript, DOM and CSS which avoids necessity in non-web rich clients

*    Outsourcing of the software development and hosting (although this is not always appropriate)

 

So, this epoch can be mainly characterized as the arrival of free high-quality and scalability for enterprise software (most of epoch 5 software was non-free). This process is still continuing and more and more big companies and governments are moving towards open-source software running on free OS, development under free IDEs, the use of free frameworks and complete free sites (example Drupal and hundreds of PHP, java frameworks).

 

 

Now, in 2008, we have reached the next crisis. How we can make IT more cost-efficient now? What remains expensive in IT infrastructures today? Where do vendor-locks continue to exist?

 

 

RDBMS used in big installations are still mostly vendor-locked and non-free. Free databases (such as MySql and PostgreSql) are successfully used by small companies for very large installations, scaling well for a very large number of clients, but still rarely used in government environments (although more and more installations have already proven their efficiency). As it is seen by many people – there will be a big shift in this area (I put a question mark in the chart – estimating the point when the period will end). I believe (sure, if the progress will go on and no reason to think otherwise) that probably all the databases will already be shifted to free ones by 2012 and most DBAs will be familiar with them - in the same way that all epoch-3 (thick client) programmers became familiar with the web. So, licensing costs on the server-side is still a big issue for big organizations.
 

Application servers and development environments are frequently non-free (due to the existence of EJB architectures) but this is quickly changing, moving to the lighter ORM solutions (such as hibernate) or direct JDBC database connections (interfacing through SQL in a PHP manner) where there are no over-engineered additional client-server layers with all the added and unnecessary caching/scalability complexities. 

The simplification of server infrastructures (removal of complex custom extra tier and scaling by means of database clustering and fail-over) and the making of the Open Source database as the standard can be compared with the arrival of Unixes during the Minicomputers 2 era (as the new simpler standard).

What else is still vendor-locked? The client (mostly XP), inherited from era 4 (license costs multiplied by number of machines), while Ubuntu or Debian (or any other) Linux distribution with OpenOffice and thousands of free tools can cover 90% of user tasks on the client desktop. Sometimes even a light-client remote office will work (similar to what Google is trying to promote), but either as a client through the old familiar X or a web-client to a Google-docs-like but – installed on the Intranet server – for familiar secure document editing. Old robust NFS, maybe together with SMB, all familiar Unix tools, but now – free.

Making big infrastructures vendor-lock free, regarding the software – is the only way – to break another closed loop and to make IT departments eat less money from organizations during hard times.

And we have shown that the whole computing history is the permanent move to less and less vendor-locked systems to cheaper and lighter (simpler) solutions. There is no reason for this process to stop. Crises are just catalysts of this process and bring more clever and simpler solutions from outside of the systems. Even such restricted-to-change organizations as government IT's have to accept those changes for their own survival and benefit. The above-mentioned possible moves (databases, desktop client) toward free solutions are only guesses, but analogies with past era movements make me think that those assumptions are right - also, this is the only right way to go (the obvious, progressive way): to the less vendor-locked world.

 

Nov 23, 2008

Toronto

 

 

*) Douglas R. HofstadterGodel, Escher, Bach: An Eternal Golden Braid”, 1979, 1999.

 

a1db4c3cab5da30e5cd0e18b6abd6c2b
  • Add to Memories

Scalable and easy maintainable archiving system for lifetime archives made from old hardware
[info]siberean

 



Old machines, even such as Pentium ММХ, Pentium 2 etc (not speaking about first Pentiums 4, long time ago been moved by some folks to the range of outdated) can work even today - for some very real tasks.
Unfortunately, I can't employ my first 8088-2, 386DX-66, 486-100 because they were lost during movings (and also during the studentship - you offen sell your old PC - to get money for the next, more powerful machine ;).
Also no sense to search for the old 1.44M diskettes with Slackware2, kernel 1.0 (how it would be required with alternative OSes) because even modern Linux distributions will perfectly work on such hardware. There are specilized light distributions (with light X window managers), but even full-featured distributions will work (although on Pentium 1-2  class machines - without Gnome/KDE, even without X. And actually X is not necessary for the tasks described below).
For example, on the screenshot - Gentoo 1.4 (2004) is shown, on another machine - Gentoo 2006 or close, and Debian - on a more powerful machine, from which the screenshot was made, and where X Windows (Gnome) is installed.

How such machines can be used efficiently?
Everybody might think about use as a router, firewall (with custom-made firewall rules for  additional filtering and logging, in cases where small consumer appliances such as DLink, Linksys etc are not giving the desired flexibility/functionality), as a shared file-server, DNS server etc.
However in this article I'll describe the archiving solution which I use many years, where more than one of such machines are working in a concert as almost zero-cost backup servers.

Another thing worth to notice here is that despite it's age - such old hardware only seems to be non-robust. It depends. For example, similar to the shown on the screenshot Pentium ММХ had been working another 5 years (after it's desktop life) as the firewall and DNS server, and even Apache was there with few modules, including jserv (with only 64M RAM on that machine), been 24/7 on. And is still alive (3 years as a desktop, 5 years as a server +).
And my Pentium3-450 is already powerful enough - to watch video, and I still use it sometimes - for Internet browsing, or even work. At the same time - today's hardware components may die after just a few years of work. So, I'm not agree with those who will say that maintaining of such machines is permanent headache - just because they worked out their resource/life. Even if we'll take the most fragile part - moving disk: I had 200G disk died while 8.5G disk is still working as a boot disk on one of old machines. And you should agree that it is good - when the hardware is working untill the very end of it's resource lifetime and sometimes (I would say - very common) such lifetime is amazingly long.

So, back to the main purpose of the article - about usage as archive servers.
Archive-servers is a special case where you need redundency of the data and even more - you need distributed redundency (so simple NAS or even RAID will not be sufficient and will not solve the main task - to be sure that a local fire or a theft of the computer will not destroy all your home archives, photos etc, the resources which cannot be reproduced or downloaded from the Internet again). At the same time such archives mirrors must be always updatable because we change ourselves and grow every day: we re-evaluate our values, reorganize archives, adding to it, make new updated copies (example - adding smaller versions of photos, reorganizing) etc. In extreme cases - we want to rename the root directories (update the ontology hierachy ;) in the main, master file-server. In everyday cases - adding of newly arrived emails, articles, copied-pasted chunks of information, updating resumes and documents, adding of other information pieces. Those changes should be incrementally propagated to the mirrors (secondary archives/backups), not taking a lot of time from our valuable lifetime. Actually we do not want to convert ourselves into sysadmins and to spend a lot of time on system administration of the backup systems during evening hours and weekends ;) so we need an automatic process.
Sometimes I also make a whole system backups (of kids' or wife's computer or a boot disk of the archive computer or another linux machine), so when it is necessary or when an old 8-10G hard-disk on an archive server will show the first signs of aging - I''ll be able to quickly replace the disk and copy a compatible system from the corresponding tarball, modifying only a few configuration files rather than wasting time on re-installation.

About terminology here: main archive server is what all workstations see (through samba or NFS) and where we make daily or weekly backups to. If any document or photo is necessary and it is already moved from the camera - the archived server is booted. After a significant portion of the work is done - the archive server is booted, almost every day (or at least weekly) - for a few minutes at least, and it does not work permanently. 
So, archive computer is the one which must be visible from all computers (workstations) on the network and is a level 1 backup or master backup, where all archive modifications are made. It is beneficial to run both NFS and SMB on it - to make it visible to all workstations including Windows.

Once in a week (or frequently - depending on how much information you do not want to loose - if a disk on the archive server will fail) - changes on the archive server is been synchronized (pushed) to the secondary archive server. That one is an incremental update and is fast, so such task does not take a lot of time. rsync is the perfect utility for one-way synchronization and actually we do not need more as will be shown shortly.
One of things one should be careful with - is rsync's "--delete" option, which is necessary - when you want your main working archive copy (master) to be propagated to the slave mirror. If a source disk will suddently crash or root catalog is removed accidently - you do not want to propogate such changes to the secondary archives). I have "--delete" always commented out, automatically propogating only additions. And rarely (or when I want to delete something expilitly) - I invoke rsync with "--delete" - watching - what is exactly been deleted. So, do not use "--delete" in automatic scripts and use it manually, under supervision.

One would use 3rd level of archive (3rd copy), propagating changes on secondary (or on first) archive server - to another copy, say, yearly. And to put that yearly archives snapshot - into a safe place, into a remote location (in case of fire, disasters etc), into a bank safe etc. This will guarantee that you will have a copy of your generation archives (or soon - more than one generation) somewhere else, not on one localtion.
Hopefully, it is clearly seen than RAID solutions are not solving the problem, they protect against local disk failure, but not - the data redundency. And looking from another side - if we already have distributed data redundency - why we need expensive RAID solutions at all? :)

Now - a bit about what and how we copy to the master archive. It depends on the preferences and significance of the daily/weekly data, in which way it comes daily (version control working copy, files, mails, data chunks, only significant emails been copied into text files or large hierarchies).
For me, for example, it is very convenient not to think each time - where I need to save (because this data will accessed rarely), but - to do it fast enough, so I'm saving files in the original form, AS IS, using directory as a unit of saving - when number of files is bigger tham\n one. If it is a chunk of data (copied/pasted piece of code, for example) - it is saved in a file.
The name of the directory in the first case and of the file in the second - is not very significant because you will have a huge number of files in the archive anyway and the name will not help much - you need an index or a database over your archives anyway. And the proper maintanance of indexes is a big topic by itself (I'm maintaining index in multiple ways including text index and scripting) and is out of scope of this article.
So, directory names and filenames may be just today's date (in a way 2008-11-21 or similar) or some another way. Some folks even use MD5 or another hash in the same way as some version controls and content management systems are doing - as additional data integrity check but I do not see a big benefit comparing with overcompication of the whole process (at the end - we use the same filesystem at the backend and filenames is not an issue - they are original filenames and storing the original filenames directly is a big benefit by itself, while control sums can be used independantly, not hiding the original filenames).
Another think worth to notice in this paragraph is that the hard-disk space is cheap today and by any means - it is not worth to spend your valuable lifetime minutes on archives maintaining
and the most frequent operation (daily copying) must be as faster (and as simpler) as possible: just copying (one command), AS IS, instantaneously.
The second most frequent operation - is weekly backup of the archive server and it also should be automated: switch on 2 machines, run script (or run script automatically on boot), wait for a few minutes (so, the backup should be incremental) and shut down the machines. If delete changes should be pushed to the secondary archives - nothing to do - you should supervise the process (as mentioned before) and spend those minutes watching - whether all disks are OK and the deleted staff is the correct one. Yearly process is not very different from the weekly one (in the process) with the exception that the amount of information is bigger, there were likely big reorganizations of the data been performed and so - the synchronization time to wait will be just bigger. 

As I've found, old Pentium - class machines are perfect as 'slave' machines for such tasks. Why 'slaves'? After hierarchy became big - I noticed that 64M RAM became too small for rsync (after the hierarchies on the disks became big) and since rsync is doing comparison of the trees in-memory- I started to use a 'master' machine - for invoking scripts and to mount other (Pentium1-class) machines through NFS. It seems - one could theoretically have unlimited number of such machines and a simple shell-script update - is the only modification - to scale up to unlimited volumes of data. So, the shown Debian (more powerful machine with X) is the 'master' machine and abovementioned archive servers (1st, 2nd, 3rd levels) are 'slaves' - in my terminilogy - like in computing clusters. Another similarity with clusters is that master gives the tasks to slaves communicating through NFS, distributing the tasks and that the number of slaves is not limited (so as the volume of the archives storage).
Another benefit of the 'master' is that it is easy to maintain scripts in one place (and sure - do not forget to backup the 'master' system into the archive too.

Why NFS is used? I tested samba and smbfs at some point and found that NFS was better utilizing the network bandwith than smb, so the more efficient, faster protocol was chosen for the backup cycle - to wait less minites weekly (sure, no sense to run the process through ssh on such processors as Pentium 1).
The only expense for the archiving solution (we are not counting disks which are necessary anyway and whcih were purchased in different years - only when required by growing data) - was IDE controllers bought on ebay for less than $30. I compiled kernel with support of 3 main controller types, I use once and reuse those kernels/systems. At some point I've got a few problems with Sil controllers while never - with Adaptec, so I'm preferring the latest one, or the  Promise ones, but this a personal preference.

I thought that using today's 1.5T disks, one could place even 6T of storage per garbage-class (for most of alternative system users) Pentium1 machine and the only difference (with the modern machines) you will notice - is the speed of copying from machine to machine, which  will not be actually magnitude of order or even multi-folded (if updates are incremental) - comparing with CPU power differences. This probably depends on the data and on multiple factors. I do not need such enterprise storages, so I use old disks with different volumes - to utilize the old disks - untill their own natual deaths. The policy is the following.
Disks are named as 1,2,3... and if a new machine is added - the count is continues. It is been inherited from the initial partitioning of the archives into physical disks (back in middle 90s). Then - when after years a disk is naturally dies or no sense to maintain, say, 100M disk, - those numbers are still preserved and moved to the new disks now as a directory (all archive disks are having one physical partition, except the ones used also for booting). If a bigger disk is coming on place of multiple smaller ones - it gets corresponding directories: 1,2,3... - depending on how many old disks are copied to it etc.
Now let's imagine that one disk dies on the main (source) archive server (on the file-server which you use for weekly backup and which you use for reorganizing of the data). You take the disk from the second mirror and put it into the master. Then - you buy a bigger disk (because you probably want to buy the one with the cheapest cost per Gigabyte for today), and depending on it's volume - create directories - for mirroring other (smaller) disks on the master. You 'rsync' from the 'master'  now and since the disk is bigger - you can free some other small disks from the slave (for example, recently I freed 200G disk which is waiting - when any of the next smallest ones 10-year old 8G or 10G disk will die). Etc. Having such substitution policy - all disks are working the whole lifetime while the copying (of both the data and the system - when the boot disk dies) - does not take a lot of time, so maintanence is minimal possible. And no data is lost. 'rsync' is doing the main job (tar - when a boot disk has to be recovered).

The above-mentioned archiving process allows to utilize the old hardware completely (which seems to be the most cost-efficient way), without any additional spending (including time). 

a1db4c3cab5da30e5cd0e18b6abd6c2b
Scripts used:
#!/bin/sh
#
# Backup script
#
# Assumptions:
# * drives on both machines are mounted as /1, /2, /3, /4
# * rsync installed on both machines
# * nfs should be running on remote host and /1..4 are exported
# * user have write permissions on local disks, we rsync to
#

FROM_HOST=192.168.0.12
TO_HOST=192.168.0.13
LOGFILE=./backup.log
TMP_MNT=/mnt


#empty file from previous session
#>$LOGFILE
for drive in 2 3 
do
	echo "Mounting ${drive}..."
	mkdir ${TMP_MNT}/${drive} 2>/dev/null
	mount -o "rsize=8192,wsize=8192,rw" ${FROM_HOST}:/ISROOT/${drive} ${TMP_MNT}/${drive}
	mount -o "rsize=8192,wsize=8192,rw" ${TO_HOST}:/ISROOT/${drive} /ISROOT/${drive}
	echo "Processing ${drive}..."
	rsync -av --delete --progress ${TMP_MNT}/${drive}/ /ISROOT/${drive}/
#	umount ${TMP_MNT}/${drive}
#	rmdir ${TMP_MNT}/${drive}
done


 
#!/bin/sh
TO_HOST=192.168.0.4

mount -o "rsize=8192,wsize=8192,rw" $TO_HOST:/disk2 /ISROOT/2 
echo "processing disk1 ..."
rsync -av --delete --progress /disk1/ /ISROOT/2/1
 
echo "processing disk2 ..."
rsync -av --delete --progress /disk2/ /ISROOT/2/2

echo "done"
  • Add to Memories

You are viewing [info]siberean's journal