Skip to content Skip to sidebar Skip to footer

How To Get Unique Values Set From A Repeating Values List

I need to parse a large log file (flat file), which contains two column of values (column-A , column-B). Values in both columns are repeating. I need to find for each unique valu

Solution 1:

Perl 'one-liner' intended/expanded out so that everything fits in the window:

$ perl -F-lane '

      $hash{ $F[0] }{ $F[1] }++;
  } END {

      for my $columnA ( keys %hash ) {

          print $columnA, " - ", join( ",", keys %$hash{$columnA} ), "\n";
      }
  '

Explanation will follow if I see a concerted attempt on the part of the original poster.

Solution 2:

I would use Python dictionaries where the dictionary keys are column A values and the dictionary values are Python's built-in Set type holding column B values

defparse_the_file():
    lower = str.lower
    split = str.split
    withopen('f.txt') as f:
        d = {}
        lines = f.read().split('\n')
        for A,B in [split(l) for l in lines]:
            try:
                d[lower(A)].add(B)
            except KeyError:
                d[lower(A)] = set(B)

        for a in d:
            print"%s - %s" % (a,",".join(list(d[a])))

if __name__ == "__main__":
    parse_the_file()

The advantage of using a dictionary is that you'll have a single dictionary key per column A value. The advantage of using a set is that you'll have a unique set of column B values.

Efficiency notes:

  • The use of try-catch is more efficient than using an if\else statement to check for initial cases.
  • The evaluation and assignment of the str functions outside of the loop is more efficient than simply using them inside the loop.
  • Depending on the proportion of new A values vs. reappearance of A values throughout the file, you may consider using a = lower(A) before the try catch statement
  • I used a function, as accessing local variables is more efficient in Python than accessing global variables
  • Some of these performance tips are from here

Testing the code above on your input example yields:

xxxd - 4
xxxa - 1,3,2
xxxb - 2
xxxc - 3

Solution 3:

You can use this simple multimap:

classMultiMap(object):
    values = {}

    def__getitem__(self, index):
        returnself.values[index]
    def__setitem__(self, index, value):
        ifnotself.values.has_key(index):
            self.values[index] = []
        self.values[index].append(value)
    def__repr__(self):
        return repr(self.values)

See it in action: http://codepad.org/xOOrlbnf

Solution 4:

Simple Perl version:

#!/usr/bin/perluse strict;
use warnings;

my (%v, @row);

foreach (<DATA>) {
        chomp;
        $_ = lc($_);
        @row = split(/\s+/, $_);
        push( @{ $v{$row[0]} }, $row[1]);
} 

foreach (sortkeys %v) {
        print"$_ - ", join( ", ", @{ $v{$_} } ), "\n";
}

__DATA__
xxxA 2
xxxA 1
xxxB 2
XXXC 3
XXXA 3
xxxD 4

Did not focus on variable names. From example i see they are not case sensitive.

Solution 5:


f = """xxxA 2
xxxA 1
xxxB 2
XXXC 3
XXXA 3
xxxD 4"""


d = {}

for line in f.split("\n"):
    key, val = line.lower().split()
    try:
        d[key].append(val)        
    except KeyError:
        d[key] = [val]


print d

Python

Post a Comment for "How To Get Unique Values Set From A Repeating Values List"