|
Hi All,
I have huge trade file with milions of trades.I need to remove duplicate records (e.g I have following records)
30/10/2009,trdeId1,..,..
26/10/2009.tradeId1,..,..,,
30/10/2009,tradeId2,..
In the above case i need to filter duplicate recods and I should get following output.
30/10/2009,trdeId1,..,..
30/10/2009,tradeId2,..
(trade record with latest COB date)
COB -closed of business day
I need to handle following three conditions.
1.Trade file will be sorted in ascending order on first two columns(COB date and trade id)
2.Trade file will be sorted in descending order on first two columns(COB date and trade id)
3.Trade file may not have duplicate records.
In all the above condition my code should work.
I have written following code.but it doen't seems to be working.As i m new to awk can anybody help me in getting this.
#!/usr/bin/gawk
BEGIN {
FS = ","
}
END {
print prevLine;
}
{
if( FNR == 1)
{
prevDate=$1;
prevSourceTradeId=$2;
prevLine=$0;
}
else
{
if(prevSourceTradeId==$2)
{
if((compareDate(prevDate,$1) == 1))
{
print prevLine;
flag=true
}
else
{
prevDate=$1;
prevLine=$0;
prevSourceTradeId=$2;
print prevLine;
flag=true
}
}
else
{
if(flag)
{
prevDate=$1;
prevSourceTradeId=$2;
prevLine=$0;
}
else print prevLine;
prevDate=$1;
prevSourceTradeId=$2;
prevLine=$0;
flag=false;
}
}
}
}
function compareDate(lhsDate, rhsDate)
{
lhsSize = split(lhsDate, lhsFields, "/");
rhsSize = split(rhsDate, rhsFields, "/");
if(lhsSize != rhsSize)
{
print "Invalid prevDate " lhsDate " "rhsDate;
return 0;
}
for(i = rhsSize; i > 0; i--)
{
if(lhsFields[i] > rhsFields[i]) return 1;
}
return 0;
}
{
|