|
|
|
Sponsored Link •
|
|
Advertisement
|
What needs to be done if you want to integrate a new collection class, so that it can profit from all predefined operations at the right types? On the next few pages you'll be walked through two examples that do this.
Integrating sequences
abstract class Base
case object A extends Base
case object T extends Base
case object G extends Base
case object U extends Base
object Base {
val fromInt: Int => Base = Array(A, T, G, U)
val toInt: Base => Int = Map(A -> 0, T -> 1, G -> 2, U -> 3)
}
import collection.IndexedSeqLike
import collection.mutable.{Builder, ArrayBuffer}
import collection.generic.CanBuildFrom
final class RNA1 private (val groups: Array[Int],
val length: Int) extends IndexedSeq[Base] {
import RNA1._
def apply(idx: Int): Base = {
if (idx < 0 || length <= idx)
throw new IndexOutOfBoundsException
Base.fromInt(groups(idx / N) >> (idx % N * S) & M)
}
}
object RNA1 {
// Number of bits necessary to represent group
private val S = 2
// Number of groups that fit in an Int
private val N = 32 / S
// Bitmask to isolate a group
private val M = (1 << S) - 1
def fromSeq(buf: Seq[Base]): RNA1 = {
val groups = new Array[Int]((buf.length + N - 1) / N)
for (i <- 0 until buf.length)
groups(i / N) |= Base.toInt(buf(i)) << (i % N * S)
new RNA1(groups, buf.length)
}
def apply(bases: Base*) = fromSeq(bases)
}
Say you want to create a new sequence type for RNA strands, which are sequences of bases A (adenine), T (thymine), G (guanine), and U (uracil). The definitions for bases are easily set up as shown in Listing 5, RNA Bases.
Every base is defined as a case object that inherits from a common abstract class Base. The Base class has a companion object that defines two functions that map between bases and the integers 0 to 3. You can see in the examples two different ways to use collections to implement these functions. The toInt function is implemented as a Map from Base values to integers. The reverse function, fromInt, is implemented as an array. This makes use of the fact that both maps and arrays are functions because they inherit from the Function1 trait.
The next task is to define a class for strands of RNA. Conceptually, a strand of RNA is simply a Seq[Base]. However, RNA strands can get quite long, so it makes sense to invest some work in a compact representation. Because there are only four bases, a base can be identified with two bits, and you can therefore store sixteen bases as two-bit values in an integer. The idea, then, is to construct a specialized subclass of Seq[Base], which uses this packed representation.
Listing 6, RNA strands, presents the first version of this class. It will be refined later. The class RNA1 has a constructor that takes an array of Ints as its first argument. This array contains the packed RNA data, with sixteen bases in each element, except for the last array element, which might be partially filled. The second argument, length, specifies the total number of bases on the array (and in the sequence). Class RNA1 extends IndexedSeq[Base]. Trait IndexedSeq, which comes from package scala.collection.immutable, defines two abstract methods, length and apply. These need to be implemented in concrete subclasses. Class RNA1 implements length automatically by defining a parametric field of the same name. It implements the indexing method apply with the code given in class RNA1. Essentially, apply first extracts an integer value from the groups array, then extracts the correct two-bit number from that integer using right shift (>>) and mask (&). The private constants S, N, and M come from the RNA1 companion object. S specifies the size of each packet (i.e. two); N specifies the number of two-bit packets per integer; and M is a bit mask that isolates the lowest S bits in a word.
Note that the constructor of class RNA1 is private. This means that clients cannot create RNA1 sequences by calling new, which makes sense, because it hides the representation of RNA1 sequences in terms of packed arrays from the user. If clients cannot see what the representation details of RNA sequences are, it becomes possible to change these representation details at any point in the future without affecting client code. In other words, this design achieves a good decoupling of the interface of RNA sequences and its implementation. However, if constructing an RNA sequence with new is impossible, there must be some other way to create new RNA sequences, else the whole class would be rather useless. In fact there are two alternatives for RNA sequence creation, both provided by the RNA1 companion object. The first way is method fromSeq, which converts a given sequence of bases (i.e., a value of type Seq[Base]) into an instance of class RNA1. The fromSeq method does this by packing all the bases contained in its argument sequence into an array, then calling RNA1's private constructor with that array and the length of the original sequence as arguments. This makes use of the fact that a private constructor of a class is visible in the class's companion object.
The second way to create an RNA1 value is provided by the apply method in the RNA1 object. It takes a variable number of Base arguments and simply forwards them as a sequence to fromSeq. Here are the two creation schemes in action:
scala> val xs = List(A, G, T, A)
xs: List[Product with Base] = List(A, G, T, A)
scala> RNA1.fromSeq(xs)
res1: RNA1 = RNA1(A, G, T, A)
scala> val rna1 = RNA1(A, U, G, G, T)
rna1: RNA1 = RNA1(A, U, G, G, T)
Here are some more interactions with the RNA1 abstraction:
scala> rna1.length
res2: Int = 5
scala> rna1.last
res3: Base = T
scala> rna1.take(3)
res4: IndexedSeq[Base] = Vector(A, U, G)
final class RNA2 private (
val groups: Array[Int],
val length: Int
) extends IndexedSeq[Base] with IndexedSeqLike[Base, RNA2] {
import RNA2._
override def newBuilder: Builder[Base, RNA2] =
new ArrayBuffer[Base] mapResult fromSeq
def apply(idx: Int): Base = // as before
}
Now that you understand why things are the way they are, the next question should be what needs to be done to change them? One way to do this would be to override the take method in class RNA1, maybe like this:
def take(count: Int): RNA1 = RNA1.fromSeq(super.take(count))
To be able to do this, IndexedSeqLike bases itself on the newBuilder abstraction, which creates a builder of the right kind. Subclasses of trait IndexedSeqLike have to override newBuilder to return collections of their own kind. In class RNA2, the newBuilder method returns a builder of type Builder[Base, RNA2].
To construct this builder, it first creates an ArrayBuffer, which itself is a Builder[Base, ArrayBuffer]. It then transforms the ArrayBuffer builder by calling its mapResult method to an RNA2 builder. The mapResult method expects a transformation function from ArrayBuffer to RNA2 as its parameter. The function given is simply RNA2.fromSeq, which converts an arbitrary base sequence to an RNA2 value (recall that an array buffer is a kind of sequence, so RNA2.fromSeq can be applied to it).
If you had left out the newBuilder definition, you would have gotten an error message like the following:
RNA2.scala:5: error: overriding method newBuilder in trait
TraversableLike of type => scala.collection.mutable.Builder[Base,RNA2];
method newBuilder in trait GenericTraversableTemplate of type
=> scala.collection.mutable.Builder[Base,IndexedSeq[Base]] has
incompatible type
class RNA2 private (val groups: Array[Int], val length: Int)
^
one error found
With the refined implementation of the RNA2 class, methods like take, drop, or filter work now as expected:
scala> val rna2 = RNA2(A, U, G, G, T)
rna2: RNA2 = RNA2(A, U, G, G, T)
scala> rna2 take 3
res5: RNA2 = RNA2(A, U, G)
scala> rna2 filter (U !=)
res6: RNA2 = RNA2(A, G, G, T)
However, there is another class of methods in collections that are not dealt with yet. These methods do not always return the collection type exactly. They might return the same kind of collection, but with a different element type. The classical example of this is the map method. If s is a Seq[Int], and f is a function from Int to String, then s.map(f) would return a Seq[String]. So the element type changes between the receiver and the result, but the kind of collection stays the same.
There are a number of other methods that behave like map. For some of them you would expect this (e.g., flatMap, collect), but for others you might not. For instance, the append method, ++, also might return a result of different type as its arguments--appending a list of String to a list of Int would give a list of Any. How should these methods be adapted to RNA strands? Ideally we'd expect that mapping bases to bases over an RNA strand would yield again an RNA strand:
scala> val rna = RNA(A, U, G, G, T)
rna: RNA = RNA(A, U, G, G, T)
scala> rna map { case A => T case b => b }
res7: RNA = RNA(T, U, G, G, T)
scala> rna ++ rna
res8: RNA = RNA(A, U, G, G, T, A, U, G, G, T)
scala> rna map Base.toInt
res2: IndexedSeq[Int] = Vector(0, 3, 2, 2, 1)
scala> rna ++ List("missing", "data")
res3: IndexedSeq[java.lang.Object] =
Vector(A, U, G, G, T, missing, data)
scala> val rna2 = RNA2(A, U, G, G, T)
rna2: RNA2 = RNA2(A, U, G, G, T)
scala> rna2 map { case A => T case b => b }
res0: IndexedSeq[Base] = Vector(T, U, G, G, T)
scala> rna2 ++ rna2
res1: IndexedSeq[Base] = Vector(A, U, G, G, T, A, U, G, G, T)
def map[B, That](f: A => B)
(implicit cbf: CanBuildFrom[Repr, B, That]): That
How is the That type determined? In fact it is linked to the other types by an implicit parameter cbf, of type CanBuildFrom[Repr, B, That]. These CanBuildFrom implicits are defined by the individual collection classes. In essence, an implicit value of type CanBuildFrom[From, Elem, To] says: "Here is a way, given a collection of type From, to build with elements of type Elem a collection of type To."
final class RNA private (val groups: Array[Int], val length: Int)
extends IndexedSeq[Base] with IndexedSeqLike[Base, RNA] {
import RNA._
// Mandatory re-implementation of `newBuilder` in `IndexedSeq`
override protected[this] def newBuilder: Builder[Base, RNA] =
RNA.newBuilder
// Mandatory implementation of `apply` in `IndexedSeq`
def apply(idx: Int): Base = {
if (idx < 0 || length <= idx)
throw new IndexOutOfBoundsException
Base.fromInt(groups(idx / N) >> (idx % N * S) & M)
}
// Optional re-implementation of foreach,
// to make it more efficient.
override def foreach[U](f: Base => U): Unit = {
var i = 0
var b = 0
while (i < length) {
b = if (i % N == 0) groups(i / N) else b >>> S
f(Base.fromInt(b & M))
i += 1
}
}
}
object RNA {
private val S = 2 // number of bits in group
private val M = (1 << S) - 1 // bitmask to isolate a group
private val N = 32 / S // number of groups in an Int
def fromSeq(buf: Seq[Base]): RNA = {
val groups = new Array[Int]((buf.length + N - 1) / N)
for (i <- 0 until buf.length)
groups(i / N) |= Base.toInt(buf(i)) << (i % N * S)
new RNA(groups, buf.length)
}
def apply(bases: Base*) = fromSeq(bases)
def newBuilder: Builder[Base, RNA] =
new ArrayBuffer mapResult fromSeq
implicit def canBuildFrom: CanBuildFrom[RNA, Base, RNA] =
new CanBuildFrom[RNA, Base, RNA] {
def apply(): Builder[Base, RNA] = newBuilder
def apply(from: RNA): Builder[Base, RNA] = newBuilder
}
}
Now the behavior of map and ++ on RNA2 sequences becomes clearer. There is no CanBuildFrom instance that creates RNA2 sequences, so the next best available CanBuildFrom was found in the companion object of the inherited trait IndexedSeq. That implicit creates IndexedSeqs, and that's what you saw when applying map to rna2.
To address this shortcoming, you need to define an implicit instance of CanBuildFrom in the companion object of the RNA class. That instance should have type CanBuildFrom[RNA, Base, RNA]. Hence, this instance states that, given an RNA strand and a new element type Base, you can build another collection which is again an RNA strand. Listings 8 and 9 show the details. Compared to class RNA2, there are two important differences. First, the newBuilder implementation has moved from the RNA class to its companion object. The newBuilder method in class RNA simply forwards to this definition. Second, there is now an implicit CanBuildFrom value in object RNA. To create such an object you need to define two apply methods in the CanBuildFrom trait. Both create a new builder for an RNA collection, but they differ in their argument list. The apply() method simply creates a new builder of the right type. By contrast, the apply(from) method takes the original collection as argument. This can be useful to adapt the dynamic type of builder's return type to be the same as the dynamic type of the receiver. In the case of RNA this does not come into play because RNA is a final class, so any receiver of static type RNA also has RNA as its dynamic type. That's why apply(from) also simply calls newBuilder, ignoring its argument.
That is it. The final RNA class implements all collection methods at their natural types. Its implementation requires a little bit of protocol. In essence, you need to know where to put the newBuilder factories and the canBuildFrom implicits. On the plus side, with relatively little code you get a large number of methods automatically defined. Also, if you don't intend to do bulk operations like take, drop, map, or ++ on your collection you can choose to not go the extra length and stop at the implementation shown in for class RNA1.
The discussion so far centered on the minimal amount of definitions needed to define new sequences with methods that obey certain types. But in practice you might also want to add new functionality to your sequences or to override existing methods for better efficiency. An example of this is the overridden foreach method in class RNA. foreach is an important method in its own right because it implements loops over collections. Furthermore, many other collection methods are implemented in terms of foreach. So it makes sense to invest some effort optimizing the method's implementation. The standard implementation of foreach in IndexedSeq will simply select every i'th element of the collection using apply, where i ranges from 0 to the collection's length minus one. So this standard implementation selects an array element and unpacks a base from it once for every element in an RNA strand. The overriding foreach in class RNA is smarter than that. For every selected array element it immediately applies the given function to all bases contained in it. So the effort for array selection and bit unpacking is much reduced.
|
Sponsored Links
|