Posted on 2009-04-10 04:24:43-07 by nmumbarkar
Need to filter duplicate records
Hi All, I have huge trade file with milions of trades.I need to remove duplicate records (e.g I have following records) 30/10/2009,trdeId1,..,.. 26/10/2009.tradeId1,..,..,, 30/10/2009,tradeId2,.. In the above case i need to filter duplicate recods and I should get following output. 30/10/2009,trdeId1,..,.. 30/10/2009,tradeId2,.. (trade record with latest COB date) COB -closed of business day I need to handle following three conditions. 1.Trade file will be sorted in ascending order on first two columns(COB date and trade id) 2.Trade file will be sorted in descending order on first two columns(COB date and trade id) 3.Trade file may not have duplicate records. In all the above condition my code should work. I have written following code.but it doen't seems to be working.As i m new to awk can anybody help me in getting this.
#!/usr/bin/gawk BEGIN { FS = "," } END { print prevLine; } { if( FNR == 1) { prevDate=$1; prevSourceTradeId=$2; prevLine=$0; } else { if(prevSourceTradeId==$2) { if((compareDate(prevDate,$1) == 1)) { print prevLine; flag=true } else { prevDate=$1; prevLine=$0; prevSourceTradeId=$2; print prevLine; flag=true } } else { if(flag) { prevDate=$1; prevSourceTradeId=$2; prevLine=$0; } else print prevLine; prevDate=$1; prevSourceTradeId=$2; prevLine=$0; flag=false; } } } } function compareDate(lhsDate, rhsDate) { lhsSize = split(lhsDate, lhsFields, "/"); rhsSize = split(rhsDate, rhsFields, "/"); if(lhsSize != rhsSize) { print "Invalid prevDate " lhsDate " "rhsDate; return 0; } for(i = rhsSize; i > 0; i--) { if(lhsFields[i] > rhsFields[i]) return 1; } return 0; } {
Direct Responses: Write a response
Perl Weekly newsletter
A free weekly newsletter for people who are busy to read all the blogs. click here to check it out.