我有几个不同方法的组合尝试的数据集
( approaches 1 to 3 ) 识别基因组中的位置:

source  chromosome1 bp1 chromosome2 bp2
attempt1    2L  5890205 2L  5890720
attempt2    2L  5890205 2L  5890721
attempt1    2L  22220720    2L  22255744
attempt1    3L  15568694    3L  15568866
attempt3    3R  14006279    3R  14008254
attempt1    3R  14006281    3R  14008253
attempt2    3R  14006282    3R  14008254
attempt3    3R  14006286    3R  14008254
attempt1    3R  32060908    3R  32061196
attempt1    3R  32066206    3R  32068392
attempt3    3R  32066206    3R  32068392
attempt2    3R  32066207    3R  32068393
attempt2    X   4574312 X   4576608
attempt1    X   4574313 X   4576607
attempt3    X   4574313 X   4576608

我想查找和分组每次尝试已确定的位置,允许一些错误的空间。例如,我想对前两行进行分类...
source  chromosome1 bp1 chromosome2 bp2
attempt1    2L  5890205 2L  5890720
attempt2    2L  5890205 2L  5890721

...作为单个事件( event 1 ),已通过两次不同的尝试( attempt1attempt2 )识别。只有在不同的尝试时,我才想将此类实例归类为单个事件:
  • 同意 bp1 +/- 5 的位置(即在窗口内 5890200..5890210 )
  • 标识相同的chromosome1chromosome2 ( 2L )
  • 同意 bp2 +/- 5 的位置(即在窗口内 5890715..5890725 )

  • 我试图使用每个染色体和 bp 作为散列中的单独键来实现这一点
    my %SVs;
    my $header;
    
    # Make hash
    while(<$in>){
      chomp;
      if ($. == 1){
          $header = $_;
          next;
      }
      my ($source, $chromosome1, $bp1, $chromosome2, $bp2) = split;
    
      push @{$SVs{$chromosome1}{$bp1}{$chromosome2}{$bp2}}, $_;
    
      }
    }
    

    ...然后在每行的每个 bp1 和 bp2 值周围使用滑动窗口方法:
    my %events;
    for my $chr1 ( sort keys %SVs ){
      for my $bp1 ( sort { $a <=> $b } keys $SVs{$chr1} ){
        my $w1_start = ( $bp1 - 5 );
        my $w1_end = ( $bp1 + 5 );
        my $window1 = "$w1_start-$w1_end";
    
        for my $chr2 ( sort keys $SVs{$chr1}{$bp1} ){
          for my $bp2 ( sort { $a <=> $b } keys $SVs{$chr1}{$bp1}{$chr2} ){
    
            my $w2_start = ( $bp2 - 5 );
            my $w2_end = ( $bp2 + 5 );
            my $window2 = "$w2_start-$w2_end";
    
            for ( $w1_start .. $w1_end ){
              if ($bp1 == $_){
                push @{$events{$chr1}{$window1}}, @{$SVs{$chr1}{$bp1}{$chr2}{$bp2}};
              }
            }
    
            for ( $w2_start .. $w2_end ){
              if ($bp2 == $_){
                push @{$events{$chr2}{$window2}}, @{$SVs{$chr1}{$bp1}{$chr2}{$bp2}};
              }
            }
    
          }
        }
      }
    }
    
    print Dumper \%events;
    

    这实现了我想要的部分内容,但我无法弄清楚如何获得我想要的输出:
    event   source  chromosome1 bp1 chromosome2 bp2
    1   attempt1    2L  5890205 2L  5890720
    1   attempt2    2L  5890205 2L  5890721
    2   attempt1    2L  22220720    2L  22255744
    3   attempt1    3L  15568694    3L  15568866
    4   attempt3    3R  14006279    3R  14008254
    4   attempt1    3R  14006281    3R  14008253
    4   attempt2    3R  14006282    3R  14008254
    4   attempt3    3R  14006286    3R  14008254
    5   attempt1    3R  32060908    3R  32061196
    6   attempt1    3R  32066206    3R  32068392
    6   attempt3    3R  32066206    3R  32068392
    6   attempt2    3R  32066207    3R  32068393
    7   attempt2    X   4574312 X   4576608
    7   attempt1    X   4574313 X   4576607
    7   attempt3    X   4574313 X   4576608
    

    最佳答案

    下面通过添加到等价类的最后一个条目定义每个等价类(基于我对您上面评论的理解):

    #!/usr/bin/env perl
    
    use strict;
    use warnings;
    
    run(\*DATA);
    
    sub run {
        my $fh = shift;
        my @header = split ' ', scalar <$fh>;
    
        my @events = ([ get_next_event($fh, \@header)]);
    
        while (my $event = get_next_event($fh, \@header)) {
            # change the -1 in the second subscript to 0
            # if you want to always compare to the first
            # event added to the equivalence class
            if (same_event($events[-1][-1], $event, 5)) {
                push @{ $events[-1] }, $event;
                next;
            }
    
            push @events, [ $event ];
        }
    
        print join("\t", event => @header), "\n";
        for my $i (1 .. @events) {
            for my $ev (@{ $events[$i - 1] }) {
                print join("\t", $i, @{$ev}{@header}), "\n";
            }
        }
    }
    
    sub get_next_event {
        my $fh = shift;
        my $header = shift;
        return unless defined(my $line = <$fh>);
        return unless $line =~ /\S/;
    
        my %event;
        @event{ @$header } = split ' ', $line;
        return \%event;
    }
    
    sub same_event {
        my ($x, $y, $threshold) = @_;
    
        return if $x->{chromosome1} ne $y->{chromosome1};
        return if abs($x->{bp1} - $y->{bp1}) > $threshold;
        return if abs($x->{bp2} - $y->{bp2}) > $threshold;
        return 1;
    }
    
    __DATA__
    source  chromosome1 bp1 chromosome2 bp2
    attempt1    2L  5890205 2L  5890720
    attempt2    2L  5890205 2L  5890721
    attempt1    2L  22220720    2L  22255744
    attempt1    3L  15568694    3L  15568866
    attempt3    3R  14006279    3R  14008254
    attempt1    3R  14006281    3R  14008253
    attempt2    3R  14006282    3R  14008254
    attempt3    3R  14006286    3R  14008254
    attempt1    3R  32060908    3R  32061196
    attempt1    3R  32066206    3R  32068392
    attempt3    3R  32066206    3R  32068392
    attempt2    3R  32066207    3R  32068393
    attempt2    X   4574312 X   4576608
    attempt1    X   4574313 X   4576607
    attempt3    X   4574313 X   4576608
    

    输出:

    event   source  chromosome1 bp1 chromosome2 bp2
    1   attempt1    2L  5890205 2L  5890720
    1   attempt2    2L  5890205 2L  5890721
    2   attempt1    2L  22220720    2L  22255744
    3   attempt1    3L  15568694    3L  15568866
    4   attempt3    3R  14006279    3R  14008254
    4   attempt1    3R  14006281    3R  14008253
    4   attempt2    3R  14006282    3R  14008254
    4   attempt3    3R  14006286    3R  14008254
    5   attempt1    3R  32060908    3R  32061196
    6   attempt1    3R  32066206    3R  32068392
    6   attempt3    3R  32066206    3R  32068392
    6   attempt2    3R  32066207    3R  32068393
    7   attempt2    X   4574312 X   4576608
    7   attempt1    X   4574313 X   4576607
    7   attempt3    X   4574313 X   4576608
    

    关于arrays - 包含在数字范围内的组哈希键,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/44570474/

    10-13 21:27