/**********************************************************************
    Copyright (C) 2004 Database Systems Lab, Supercomputer Education and
    Research Centre, Indian Institute of Science, Bangalore, INDIA.
    http://dsl.serc.iisc.ernet.in

    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation; either version 2 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program; if not, write to the Free Software
    Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
***********************************************************************/

Short Description
-----------------	
	This package gives an implementation of 'EMASK' algorithm over 
	'Apriori' algorithm for boolean association rule mining. 'EMASK' 
	is an an efficient algorithm with some modifications to the 
	original algorithm 'MASK' for privacy preserving boolean 
	association rule mining.

	In EMASK, the distortion process is generalized to perform
	{symbol-specific} distortion -- that is, different distortion
	parameters are used for 1's and 0's in a transaction.(1 indicates 
	the presence of an item and 0 indicates the absence of an item 
	in a transaction). Estimation procedures are designed to carefully 
	chose the parameters of distortion beforehand and a variety of 
	optimizations are applied in the mining process to achieve the 
	desired goals. 
	
	To get the results of MASK algorithm from this package run the 
	program with distortion probability p and q both set as p=q=0.9
	
	References:
	===========
1.	To understand the implementation of 'Apriori' algorithm, read 
	the original paper that gave the algorithm:
 
	Fast Algorithms for Mining Association Rules,
    		By R. Agrawal and R. Srikant,
		In Proc. of 20th VLDB Conf., September 1994

2.	In a later paper, Agrawal mentions that the 2nd pass in the algorithm
	is to be accomplished using a 2d array rather than a hashtree data
	structure.  This implementation of the algorithm takes this into
	effect.  The later paper mentioned refers to --
 
	Parallel Mining of Association Rules:
	Design, Implementation and Experience,
		By R. Agrawal and J. Shafer,
	    	as Tech-report, No. RJ10004,
		IBM Almaden Research Center, San Jose, CA 95120,
		January 1996

3.	For details on MASK algorithm , refer to following publication	
	
	"Maintaining Data Privacy in Association Rule Mining",
		S. Rizvi and J. Haritsa,
		Proc. of 28th Intl. Conf. on Very Large Databases (VLDB),
		August 2002.	
	
4. 	For EMASK algorithm refer to following publication
	
	"On Addressing Efficiency concerns in Privacy Preserving Data Mining", 
		Shipra Agrawal, Vijay Krishnan and Jayant R. Haritsa,
		9th International Conference on Database Systems For 
		Advanced Applications (DASFAA) 2004, Jeju Island, Korea. 	

Directory Structure
-------------------
COPYRIGHT.GPL		GNU Public license

distort-binary.C 	distorts input data for privacy preserving 
			association rule mining. takes as input a data 
			file in IBM Synthetic Database format for distortion.

distort-boolean.C 	distorts input data for privacy preserving 
			association rule mining. assumes data file in 
			boolean matrix format.
	
EMASK.C 		implements EMASK algorithm. 
	
include 		directory containing .h files for EMASK.C .


	
Compile Instructions
--------------------	
	g++ -Iinclude EMASK.C -o EMASK
	g++ distort-boolean.C -o distort-boolean
	g++ distort-binary.C -o distort-binary

	The code-base has been compiled and tested under g++ (GCC) 2.95.3
	Please install this version from http://gcc.gnu.org/gcc-2.95/ if you have problem in compiling


USAGE
-----
	Usage ./EMASK <meta file> <min support> <Result File> <p> <q>

	metafile format:

	<Input data file name>\n
	<no. of items>\n
	<no. of rows>\n
	
	Usage ./distort-boolean <prob of unflipping 1> <prob. of unflipping 0> <num of items> <undistorted data file in boolean matrix form> <distorted file>	

Structure of the input-file
---------------------------
	* Here input data file is distorted binary file in IBM synthetic database format
	where each tuple is in following format
	<transaction no><customer no><no. of items><item1><item2>....<itemn>
	* For reading from a input data file in form of a Boolean matrix , replace 
	svector.h with svector.boolean.h
	* To read from a file in your own format make changes to database_read function in svector.h



License
-------
	The code base is distributed under GPL. Please refer to
	COPYRIGHT.GPL in the top-level for the details about the license.


Reporting problems 
------------------
	send mail to shipra@dsl.serc.iisc.ernet.in
