Feature Fusion For The Uninitiated

Feature Fusion

Consider a typical e-commerce product. It would have a variety of content specific features like product title, brand, thumbnail etc and other engagement driven features like number of clicks, click through rate etc. Any machine learning model ingesting features of this product(e.g. product ranker, recommendation model etc.) would have to deal with the problem of merging these distinct features. Broadly speaking we can divide these features in following categories

  • Dense Features
    • Counter : #times clicked in k days/hours
    • Ratio : CTR of entity
    • Float: Price
  • Sparse Features
    • Category Id
    • Listing id : [ids of last k products clicked]
  • Rich Features
    • Title Text Embedding
    • Category/Attributes Embedding

In a Neural Network if we step beyond the input layer, a common problem faced is around fusing (merging) these features and creating a unified embedding of the entity (e-commerce product in our case). This unified embedding is then further transformed in higher layers of the network. Considering the variety of features (float value features, embedding based features, categorical features etc), performance of the model depends on when and how we process these features.

In rest of the post we would discuss various feature fusion techniques.

Embedding Feature Fusion Techniques

For the sake of simplicity we would first consider the case of how to handle multiple embeddings in the input feature layer. Later on we would include architectural techniques to merge dense (float value features). As an example if we are building a music recommendation model, one of the input to the recommender may be embeddings of last k songs listened by the user or embeddings of genres of last k songs listened. Now we would want to fuse(merge) these embeddings to generate a single representation for this feature. Next section would discuss techniques to combine these embeddings in the input layer and generate a unified embedding.

Method 1 – Concatenate And Transform

In the diagram below, input layer can have text, images and other types of features. Respective encoder would process the corresponding input (e.g. pre-trained Resnet can generate image embedding, pre trained BERT can generate text embedding etc) and generate an embedding. These embeddings are represented as vector-1, vector-2 and vector-3.

In concatenation based fusion technique we would concatenate embeddings belonging to each modality and then pass it through a fully connected network.

Method 2 – Pooling

Another light weight technique is use techniques like mean and max pooling. In the image below, Youtube session recommender is fusing embeddings of watched videos in a session using average of their embeddings.

Method 3 – Pair Wise Dot Product

In this method we would take pairwise dot product of each embedding (vectors in the image below). After the pair wise dot products we would further use MLP (or Full connected) layers to transform it.

Pair Wise Dot Product

One of the drawback of this technique is its computational complexity. Due to quadratic computational complexity, the number of dot products grow quadratically in terms of the number of embedding vectors. With increasingly input layer, the connection of dot products with fully connected layer can become an efficiency bottleneck and can constrain model capacity.

To overcome this issue we can resort to Compressed Dot Product Architecture discussed in the next section

Method 4 – Compressed Dot Product Architecture

Dot Product Of Embeddings As Matrix Multiplication

In the above figure we have showcased pairwise dot product fusion as matrix multiplication. Consider X as a matrix of feature embeddings. Here we have 100 features and each feature embedding is of length 64. Pair wise dot product can be view as a product of feature matrix and its transform.

The compression technique exploits the fact that the dot product matrix XXT has rank d when d <= n, where d is the dimensionality of embedding vectors and n is the number of features.

 Thus, XXT is a low rank matrix that has O(n*d) degree of freedom rather than O(n2). In other words, XXT is compressible. This is true for many current model types that have sparse input data. This allows compression of the dot product without loss of information.

  • X is an n*d matrix of n d-dimensional embedding vectors (n=100;d=64).
  • Per techniques described herein, instead of calculating flatten(XXT), flatten(XXTY) is calculated.
  • Y is an n*k matrix (in the example k=15).
  • Y is a learnable parameter and is initialized randomly.

XXT would lead to n * d * n operations, where as dot product compression XXTY would take O(n * d * k) operations (k is very less than n).

Y is a learnable parameter and is initialized randomly. Y can be learnt, alongside other parts of the model. The projection through Y significantly reduces the number of neurons passing to the next layer while still preserving the information of original dot product matrix by exploiting the low rank property of the dot product matrix.

One of the drawback of this approach is that learned matrix Y becomes independent of the input. It would have same values for all inputs. To resolve this issue, we can use attention aware compression (please refer to Dot Product Matrix Compression for Machine Learning for more details).

Method 5 – Attention Fusion

In this approach we follow the self attention logic to fuse different embeddings.

Method 6 – Tree based Fusion

In this technique we concatenate the feature embeddings and provide them as a single input to a Tree Ensemble Model e.g. Boosted Trees or GBT. This model will be separately trained from the main neural network. In this phase we would take output of leaves of each tree in the ensemble. In the image below these are depicted as h1, h2 etc. The fused (transformed) embedding will be concatenation of the output of leaves (h1 + h2 + ..). This will be provided as input to the main neural network. On a high level this acts as a non linear transform of input features.

Dense (Float) Features Fusion

The special case for feature fusion is for float value features. Unlike embedding based feature fusion, here we have just one vector of float features (e.g product price, click through rate, add to cart rate etc.). For handling dense features we would pass them through a fully connected layer with non linearity. This will lead to a transformed representation that would either concatenated or interacted with sparse fused features.

Baseline Model

An example architecture for a baseline model is described below

In this architecture we are first performing fusion of dense features. Separately sparse features (or embedding features) will have dot product fusion in a separate layer. As next step we are concatenating the fused vector of dense and sparse vectors in the interaction arch. In the next part of the network we are performing transformations over the concatenated feature vector.

Dense Sparse Feature Fusion

The last piece of the puzzle is how to interact the dense features and sparse features. Till now we have seen embedding based feature fusion (let’s say via dot product fusion) and dense (float value features) fusion. In this section we would explore how to interact (fuse) them together.

For dense-sparse interactions we would first generate an intermediate representation of the dense features via performing a non linear transform. Then we would perform following two steps

  • Concatenating transformed dense feature representation to interaction layer
  • Stacking the transformed dense feature representation with embedding inputs

In this way the dense transformed representation (output of FC + Relu layer) would take part in dot product fusion, hence it will interact with the individual embeddings of the input layer. Since dense features contain bias information (e.g. CTR or number of clicks provide bias about product performance), we would want to preserve this information. Hence in a residual connection kind of way we concatenate the transformed dense representation with the output of dot product fusion.

Posted in Uncategorized | Leave a comment

Graph Neural Networks Based Attribute Discovery For E-Commerce Taxonomy Expansion

Previous post on Attribute Discovery

In Part 1 of Attribute Discovery we discussed unsupervised approaches that used Graph based Keyword and Key Phrase extraction algorithms to generate a list of candidate tokens that can be potential attributes missing from e-commerce taxonomy. We furthermore talked about various heuristics like statistical pruning and syntactic pruning based filtering techniques. This article goes one step further to dig into supervised modeling based methods to further remove noisy candidates.

As discussed in the last part of this series, one of the foremost drawback of using a totally unsupervised method for keyword extraction is that we don’t learn from our mistakes once we get feedback on extracted keywords from a Taxonomist or a human labeler. Another issue is that without a model based approach for keyword extraction, we miss out on using approaches like Active Learning that can guide us in creating better training data and hence a better feedback loop to get the contentious data points labeled. The unsupervised keyword extraction techniques are not domain sensitive i.e. they are not tunable to adapt to the data domain or topic and spits out lots of generic keywords that are not important for the domain under consideration. Also the share number of candidates generated are hard to get labeled, so we definitely need a mechanism to rank them in order of importance and get a feedback from domain experts.

Also we are not exploiting the information at hand i.e. attributes that are already part of the taxonomy. Take an example, if we already know that for category Clothing>Kids>Shoes, Color attribute type has values Green, Black And White in our Taxonomy; and candidate generator spits out new candidates as Brown and Blue, we should be able to use the data points already available to us (Green, Black and White) in associating the new keywords (Brown and Blue) to attribute type Color.

After extracting important keywords and candidate brand tokens, we would like to answer following questions

  1. Whether the new keywords belong to existing attribute types? — Attribute Type Detection
  2. If not can we group the keywords together so that similar keywords may belong to a new attribute type? — Attribute Type Discovery

To check whether a given keyword belong to a an existing attribute type we try link prediction[1] and node classification[2] based analysis using graph convolutional networks(GCN)[3].

Reference: https://tkipf.github.io/graph-convolutional-networks/

In message passing methodology, in each iteration every node gets messages from neighboring nodes and using their representations (messages) and it’s own representation, a new representation is generated. This involves application of aggregate functions and non linear transformations.

Reference: Graph Convolutional Neural Networks for Web-Scale Recommender Systems

In the above figure node A will receive representations of neighboring nodes B, C and D and after applying linear transformations , permutation invariant aggregation operator (sum, average etc.) and applying further non linear transforms generates a new representation for node A.
These representations are further used for a downstream task like link prediction, node classification, graph classification etc.

Link Prediction[1]

Data :

  • Keywords extracted from product title and description of products from Baby & Kids//Strollers & Accessories//Strollers
  • Known Attribute Types and their corresponding values in category Baby & Kids//Strollers & Accessories//Strollers i.e. data from the Taxonomy

Heterogeneous Graph Creation

In this approach we first create a graph in which nodes are represented by the extracted keywords and links between them exists if and only if they occurred alongside each other in product title and description text more than a certain number of times. Furthermore we add extra nodes to the graph that represent attribute type e.g. Brand node, Stroller Type Node etc. Additional links were created between the nodes representing the attribute types and their synonyms from the rules table. To make graph dense we added a “pseudo(dummy)” node with category name “stroller”. This node will act as a context node and provide path between existing attribute value nodes (as well as attribute type nodes) and new attribute value candidate (keyword nodes)

Types of Edges

  • Brand Node and known brand values
  • Stroller Type Node and known stroller type values
  • Dummy node “stroller” and keywords occurring in neighborhood of it in product titles and descriptions
  • Candidate keywords and rest of nodes occurring in neighborhood of it in product titles and descriptions

Objective : Given two nodes if an edge exist between them. For our use case this would mean given an Attribute Type node e.g. Brand or Stroller Type, whether a link exist between them and candidate keyword.

Node Feature :
Each node has a 100 dimensional feature vector as an attribute. These vectors are word embeddings generated using Fasttext[5].

Category : Baby & Kids//Strollers & Accessories//Strollers

Train/Test Node Tuples
After creating the graph we split the links into train and test dataset. Positive examples include node tuples where edge exists in the graph and for negative examples we created synthetic hard negative examples e.g. tuples created from sample from brand nodes and stroller type attribute values.

Model Architecture

We used Tensorflow based Stellar Graph[4] for creating a 2 GCN layer network with the last layer as binary classification layer. The hidden layer representations of both inout nodes are aggregated using point wise product and the final “edge” representation is forwarded to the binary classification layer. Adam optimizer was used to calculate gradients for back propagation.

Input : word embeddings of tuple of nodes representing an edge e.g. if an edge e is formed by vertices “Brand” and “graco” then word embedding of these two nodes would be input to the node.

Link Prediction Results

The plots below shows the link prediction probabilities on test and train data set.

Attribute Type Prediction And Issues
In the last phase we input node ids for new extracted keywords and Stroller Type attribute node to check if a link exists between them. Similarly we input brand candidates and Brand node id tuples to check if a link exists between the Brand node and new brand candidates. The results of this experiment were really bad as it seems the model is overfitted. Also there is a semantic issue with the links in the graph. One half of edges represent keyword proximity in product title and description whereas other links represent connection between attribute type and synonyms. This discrepancy in the meaning of edges in the graph may be casing issues with low link prediction accuracy on the new data set.

Node Classification[2]

In this second experiment we intend to classify the nodes in the keyword graph to its corresponding attribute type. In this graph each node has an optional attribute representing the class of the node e.g. Brand or other Attribute Type. Graph is formed based on keyword proximity in product title and description text. No additional attribute type nodes are added to the graph like we did in previous link prediction task.

Category : Toys & Games//Remote Control Toys//Cars & Trucks

  1. Data Instances : ~16k product titles and descriptions
  2. Attribute Values and Attribute Types in the Taxonomy for category Cars & Trucks
  3. as class of the attribute values

Node Features

Each node is represented by 50 dim word embedding generated from Fasttext and 5 dim embedding generated by poincare embeddings from the keyword graph. Poincare embeddings will contain the hierarchical (positional) details of the node in the graph.

Homogeneous Graph Creation

In this graph each node has an optional attribute representing the class of the node e.g. Brand or other Attribute Type. Graph is formed based on keyword proximity in product title and description text. No additional attribute type nodes are added to the graph like we did in previous link prediction task.

Like the earlier experiment we added additional nodes representing known attribute values e.g. known brands and vehicle type etc. To make the graph sufficiently dense, we again added dummy nodes for “car” snd “truck” (tokens in category name). Edges exist only when two tokens occur together in product title or description.

Model Architecture

Visual Analysis

To get visual intuition around effect of training on node representations, we would compare 2D TSNE representations of node feature embeddings before and after training.

Before Training : Train Node Feature Projection In 2D using TSNE

After Training : Train Node Feature Projection In 2D using TSNE

We generate the node representations by feeding the network feature vector of the node (fasttext word embedding concatenated with node’s poincare embedding) and generate the representation from the second GCN layer. (second hidden layer) We can clearly see that in 2D nodes belonging to same attribute type are closer to each other after training the network.

Testing the trained mode on new extracted keywords

In a similar fashion we generate the representation of the newly extracted keywords. The plots below show low dimensional visualization using TSNE of original feature vector of the extracted keywords vs representations generated from trained network.

Plot original feature vectors of keywords with unknown attribute type ( TSNE)

There seems to be no apparent structure in the above plot.

Plot GCN transformed representations of keywords with unknown attribute type (TSNE)

The above plot shows 2 clusters in data, one near the origin and second on the other side of the diagonal.

Node Classification Results

We had 59 keywords where we are unaware of the attribute types.
Threshold Creation
In this step we run the training data through our network and get the max predicted probabilities ie for each node where attribute type is known, classify it using the trained model and select the maximum from the softmax layer.
After that find the 90th percentile from list of output probabilities in the last step. This would be used as classification threshold on data where attribute types are unknown.

Node classification maximum class probability on training data

In the last step the extracted keywords are fed into the network and maximum of softmax layer is compared with the threshold compared in last step. Only when the predicted probability is greater than the calculated threshold than only we would consider the predicted class.

Output :
keyword: run, predicted attribute type: brand
keyword: jeremy, predicted attribute type: rc scale
Both seems to be misclassification.

Hierarchical Clustering the transformed representations of new keywords

After training the model on all the labeled nodes we generated representation of all nodes where class is not available i.e. nodes where we don’t know the attribute type. After that we perform hierarchical clustering on these representations.

It seems three clusters exist in our data. The contents of the clusters are as follows

Cluster 1 = ['kit', 'scale', 'bodied', 'diecast', 'toy', 'red', 'redcat',
'jeremy', 'mayfield']
Cluster 2 = ['rc', 'nitro', 'work', 'model', 'brand', 'custom', 'light', 'rear',
'seat', 'speed', 'ship', 'ford', 'barbie', 'power', 'like', 'race',
'pickup', 'condit', 'baja', 'vxl', 'talion', 'partes', 'black',
'lamborghini', 'look', 'spiderman', 'indy', 'conteol', 'lowrider']

Cluster 3 = ['charger', 'good', 'need', 'box', 'control', 'vw', 'batterie',
'come', 'remot', 'price', 'wheel', 'ride', 'drive', 'atv', 'kid',
'jeep', 'run', 'include', 'tire', 'upgrade', 'motor']

The intuition behind this experiment was that after GCN transformations newly discovered keywords that are similar to each other would group together in one cluster and that cluster may be represented by an attribute type.


We discussed various approaches that don’t solve the problem of attribute extraction individually but as a whole, pipeline comprising of the above steps as individual component can provide us an unsupervised way to generate candidate attribute values and brands that can be further pruned by taxonomists. This approach can save us lot of time by automating the attribute discovery phase and providing not only important keywords/brands but also synonyms present in data that can be further used to increase coverage. The proposed tool is neatly laying down all the relevant information about about every category that the taxonomist needs in form of keywords, brands and attributes.


  1. Link Prediction Based on Graph Neural Networks
  2. Node Classification
  3. Semi-Supervised Classification with Graph Convolutional Networks
  4. https://stellargraph.readthedocs.io/en/stable/demos/link-prediction/
  5. Enriching Word Vectors with Subword Information
Posted in Uncategorized | Leave a comment

Attribute Discovery For E-Commerce Taxonomy Expansion – Part 1 Unsupervised Graph Based Keyword Extraction

Category And Attribute Taxonomy

During my time at Facebook Marketplace I worked at a very esoteric problem of semi automating attribute discovery i.e. finding granular attribute values from product titles and description that are not present in the Product Attribute Taxonomy. Each category in e-commerce catalog has a unique set of attribute types and values e.g. Clothing > Men’s > Shirt can have Colors, Size, Pattern, Brand Attribute Types and each attribute type will have given set of values (Size can X, S, M, L, XL etc). These attributes play critical role in Query Understanding Models as well as act as critical features in Retrieval and Ranking Models. So the richer the Attribute Taxonomy is, the better would be customer experience.

At most of the e-commerce companies the current process to find these attributes involves manual analysis of data and other data sets by Taxonomists. The existing approach is a top down approach i.e. for each category Taxonomists would do competitor analysis and create attribute types and their corresponding values relevant to the concerned category.

Some of the drawbacks of manual attribute creation are

  • Scalability issues : Hard to keep up with creation of attribute types and values with more and more products/catalog being added to the catalog
  • Low Coverage : There is no way to discover attribute values present in out data that corresponds to an attribute type. This often leads to low coverage.
  • Rules/Synonyms creation : synonyms for attribute values are manually curated and in the existing pipeline coverage is directly proportional to quantity of synonyms an attribute value has.

The rest of the document would discuss some approaches that can be used to automatically discover attribute values and group similar attribute values that may represent potential attribute type. Furthermore we would try to extract all relevant synonyms for the discovered attributes. This approach is a bottom up approach i.e. we would look at the data first and then come up with a set of attributes.

Attribute Discovery Process

The proposed attribute discovery approach is unsupervised in nature. The procedure broadly involves hierarchical clustering the data and extracting relevant keywords from each cluster. In the next step we would try to further prune and cluster the keywords that are similar to each other. The aim of the proposed approach is to present clusters of high quality keywords from different categories to the taxonomists and thereby save their time and effort. By highly quality we mean keywords that are not only statistically relevant but also define either some property or quality of the core entity described in the category e.g. for category Baby & Kids > Diapering > Cloth Diapers we would want to automatically create a list of following keywords [small, large, prefold, resusable, snap, pocket, inserts, covers, lot, size, liner, organic, bamboo] that defines the properties of diapers and a second list that may contain probable brands associated with clothing diapers e.g. Bumgenius, Thirsties, Grovia, Alva, Panales, Tela. In the final step a Taxonomist can prune or select the proposed keywords.

PART 1 Analysis

Step 1: Product Representation Creation

For the experimentations sake we have only considered product title and description features to create product representation. We used pre trained Hugging Face sentence transformer to generate product embeddings of length 768. Although better representations can be created that can use other product features like image, prices etc.

Step 2: Hierarchical Clustering

In this step we performed Ward Hierarchical Clustering[1] on the 905 X 768 matrix created in the last step.

The above dendrogram plot indicates that there are 4 clusters in the Clothing Diapers Category. In Semi-supervized Class Discovery [2] and PinnerSage: Multi Modal User Embedding Framework For Recommendations[3], researchers at Google and Pinterest respectively have used similar kind of hierarchical clustering approaches to club similar products for further analysis.
Later on we would want to automate the process of detection of optimal numbers of clusters.

A quick look at the word clouds of 1st and 2nd cluster provides a glimpse into different prominent keywords present the corresponding clusters.

Cluster 1

Cluster 2

Step 3: TextRank: Keyword Extraction

This step involves extracting most relevant keywords from the clusters created in last step. We use TextRank[4], a graph based ranking model for text processing. TextRank creates a graph from tokens presents in product titles and descriptions, after certain preprocessing it uses a scoring model to rank nodes in the graph. TextRank[4] provides better results from statistical and lexical analysis based approaches. We tried exploring other methods like EmbedRank[5], that uses embedding based similarity between sentences and the whole corpora but found that TextRank best suits to our use case.

List of most relevant keywords from first cluster (applying stemming and grouping similar tokens)

('insert', ['insert', 'inserts', 'inserted']),
('size', ['size', 'sized', 'sizes']),
('cover', ['covers', 'cover']),
('brand', ['brand', 'brands']),
('lot', ['lot', 'lots']),
('newborn', ['newborn', 'newborns']),
('pocket', ['pocket', 'pockets']),
('small', ['small']),
('new', ['new']),
('prefold', ['prefolds', 'prefold']),
('snap', ['snaps', 'snap']),
('bag', ['bag', 'bags']),
('bumgenius', ['bumgenius']),
('thirsties', ['thirsties']),
('liner', ['liners', 'liner']),
('included', ['including', 'includes', 'included']),
('bum', ['bum', 'bums']),
('bamboo', ['bamboo']),
('white', ['white']),
('reusable', ['reusable']),
('organic', ['organic']),
('velcro', ['velcro']),
('lbs', ['lbs']),
('large', ['large'])

The above mapping provides us a candidate list of attribute values and their corresponding synonyms.

Step 4: Semantic Association Between Important Keywords : Network Based Analysis

After creating an initial list of candidate attribute values, the intuitive next step would be to group so that each group can be defined by an attribute type and further prune the candidate list to get most relevant keywords pertaining to category “Cloth Diapers”.

Example: In the above list “pocket”, “prefold”, “snap”, “size” defines properties of cloth diapers but “velcro” is a property of “pocket”.

In this step we create a graph of tokens extracted in last step based on their proximity in the product titles and descriptions. In this experiment we create an edge between nodes only if the corresponding keywords are within a window size of 1 i.e. adjacent to each other. We use python networx[6] for creating a directed acyclic graph(DAG) of keywords.

The above DAG clearly shows some interesting properties of our attributes.

  1. The red colored edges show direct associations of “cloth diapers” i.e. core attributes of cloth diapers.
  2. Connection between Velcro node and Pocket node shows that velcro is an aspect of pocked diapers.
  3. Similarly color “white” is an attribute of “liner” and “velcro” nodes.
  4. Two directed edges between Brand and Thirsties creates string association between them implying that Thirsties is a brand.

In similar manner we can make more inferences.

Step 5: Hierarchical Embeddings: Creating Hierarchical Relations Between Nodes

Once we have a DAG of important keywords and a natural next step would be to understand the hierarchy among the nodes. We would want to know the core nodes representing the backbone of the network and their associated satellite nodes.
Some of the ways that didn’t result into meaningful node associations

  1. PageRank: Apart from generating high ranking nodes, PageRank didn’t provide us any semantic information from the graph structure itself.
  2. Generating word embedding using FastText[7] and further using embeddings of the extracted keywords for hierarchical clustering.

In this step we generated POINCARE EMBEDDINGS[8] of the nodes of the graph created in last step.

 "Poincaré embeddings are a method to learn vector representations of nodes in a graph. The input data is of the form of a list of relations (edges) between nodes, and the model tries to learn representations such that the vectors for the nodes accurately represent the distances between them. The learnt embeddings capture notions of both hierarchy and similarity - similarity by placing connected nodes close to each other and unconnected nodes far from each other; hierarchy by placing nodes lower in the hierarchy farther from the origin, i.e. with higher norms. 
The main innovation here is that these embeddings are learnt in hyperbolic space, as opposed to the commonly used Euclidean space. The reason behind this is that hyperbolic space is more suitable for capturing any hierarchical information inherently present in the graph. Embedding nodes into a Euclidean space while preserving the distance between the nodes usually requires a very high number of dimensions."[9]
Poincare Hierarchy

We generated 2 D representations of the nodes using gensim.models.poincare[10]. The nodes near to the origin represents the root nodes of the graph. Node “clothing diaper” is at coordinate 0,0. thereby suggesting that it is the root node (key topic) of the graph. Nodes at the periphery like lbs, bum, new etc. are the leaf node and much down in the hierarchy i.e. not directly relevant to the root node “cloth diaper”.

To fetch the key attributes of clothing diapers we extract the nearest neighbors of the root node. The list below shows nearest neighbors with their corresponding

[('cloth diaper', 0.0),
 ('small', 0.1365535033774657),
 ('large', 0.17949012971429448),
 ('prefolds', 0.20087067143255224),
 ('reusable', 0.22612112692088746),
 ('inserts', 0.29712060251440736),
 ('bumgenius', 0.30584862418124015),
 ('covers', 0.3563740496401796),
 ('lot', 0.3919719432521571),
 ('pocket', 0.409106234142296),
 ('size', 0.4996583982212509),
 ('liner', 0.621093919540953),
 ('snap', 0.7515983205641947),
 ('organic', 0.8492109718887274),
 ('bamboo', 0.851353947350201)]

The above list shows some high quality candidates for attribute values associated with cloth diapers.

Step 6: Key Phrase Extraction And Text Summarization

Till now we have tried to extract important keywords descriptive of “clothing diapers” category. Another highly effective way to get a good idea of the context and use of different keywords w.r.t “clothing diapers” in all the product titles and descriptions is by text summarization. We used Spacy pytextrank[11] that creates a graph similar to keyword TextRank graph created in step 3 but uses phrases instead of keywords as graph nodes.

Summarization tries to paint a complete picture of the different topics discussed in the text. In our case the list below consists of all attributes and brands present in the clothing diaper category.

The tokens marked in bold represents probable brand names.

Part 2 – Multi Stage Approach : Brand Extraction

The first phase would generate a broad list of keywords using an ensemble of unsupervised methods. As we have seen earlier that output of keyword extraction algorithms is often noisy as well as contain generic keywords like “price”, “shipping”, “FCFS” etc. that are not category specific. Once a candidate set is generated we would like to apply a set of rules like blacklist keywords, noun/ pronoun detection, named entity detection and other heuristics to prune candidate list. Ideally a classification model should further classify the pruned set of candidates into probable attribute value or not. We would discuss this discriminator based methods in next part of this blog series.

Rest of the document would deal with explanation of the individual component described in the above architecture.

The proposed architecture comprise of two key components

I. Candidate Generation

[12] A Review of Keyphrase Extraction

From the plethora of approaches discussed in [12] and [13], the best methods of unsupervised keyword extraction are graph based. In step 6 (Keyword Extraction And Text Summarization) we have already seen that although the results of TextRank comprised of noise and generic keywords, the recall of category specific keywords was high. 

In the first phase we can just run some out of shelf keyword extraction algorithms like the one discussed below and take a union of keywords set generated by them.

A. Graph Based Keyword Extraction Methods

Majority of the popular unsupervised approaches use graph based centrality measures like Eigenvector centrality and PageRank to find relevant keywords from co-occurence token graph of concerned text. Complex Network based Supervised Keyword Extractor[23] shows that although various centrality based metrics effectively captures node importance, however probability density distribution of strength for keywords and non-keywords for the training set prepared during their study shows overlapping areas near high strength values. The overlap indicates that strength alone is not an accurate discriminator between keywords and non-keywords. The experimental results of [23] validates our findings of extracting large number of false positive keywords when we used unsupervised methods like TextRank that uses PageRank bases graph centrality metric.

Graph Centrality Metrics

  1. Eigenvector Centrality : It quantifies a node’s embedded-ness in the network while recursively taking into account the prestige of its neighbors.
  2. PageRank : It computes the prestige of a node in the context of random walk model.
  3. PositionRank: An extension of PageRank that is based on the intuition that keywords are likely to occur towards the beginning of the text rather than towards the end.
  4. Coreness: Coreness is a network degeneracy property that decomposes network G into a set of maximal connected subgraphs, such that nodes in each subgraph have degree at least k within the subgraph. Coreness of a node is the highest core to which it belongs.
  5. Clustering Coefficient : Clustering coefficient of a node indicates edge density in its neighborhood. It is a local property.
  6. Strength of a node : Strength(weighted degrees) of a node measures its embedded-ness at local level.
Density of Distribution of graph node properties for keywords and non-keywords [23]

Graph Based Keyword And Key Phrase Extraction Algorithms

  1. TextRank[4]
    • Only use nouns and adjectives as nodes in the graph
    • no edge weights for keyword graph
  2. SingleRank[14]
    • incorporates weights to edges, the co-occurrence statistics are crucial information regarding the contexts
  3. RAKE [15] (Rapid Automatic Keyword Extraction)
    • that utilizes both word frequency and word degree to assign scores to phrases
  4. SGRank [16] and PositionRank [17]
    • stage 1
      1. utilize statistical, positional, and word co-occurrence information
      2. considers only noun, adjective or verb
      3. takes into account term frequency conditions
    • stage 2
      1. the candidate n-grams are ranked based on a modified version of TfIdf
    • stage 3
      1. the top ranking candidates are re-ranked based on additional statistical heuristics, such as position of first occurrence and term length
    • stage 4
      1. the ranking produced in stage three is incorporated into a graph-based algorithm which produces the final ranking of keyphrase candidates

B. Topic Based Methods

  1. TopicRank [18]
    • preprocesses the text to extract the candidate phrases. Then, the candidate phrases are grouped into separate topics using hierarchical agglomerative clustering
    • In the next stage, a graph of topics is constructed whose edges are weighted based on a measure that considers phrases’ offset positions in the text.
    • TextRank is used to rank the topics and one keyphrase candidate is selected from each of the N most important topics
  2. Salience Rank [19]
    • It runs only once PageRank, incorporating in it a word metric called word salience Sα (w), which is a linear combination of the topic specificity and corpus specificity of a word (the last can be calculated counting word frequencies in a specific corpus). Intuitively, topic specificity measures how much a word is shared across topics (the less the word is shared across topics, the higher its topic specificity)

C. Graph-based Methods with Semantics

“The main problems of the topic-based methods are that the topics are too general and vague. In addition, the co-occurrence-based methods suffer from information loss, i.e., if two words never co-occur within a window size in a document, there will be no edges to connect them in the corresponding graph-of-words even though they are semantically related, whereas the statistics-based methods suffer from information overload, i.e., the real meanings of words in the document may be overwhelmed by the large amount of external texts used for the computation of statistical information.”[13]

Method 1 – Distant supervision using knowledge graphs[20]:
a. Nouns and named entities (keyterms) are selected and grouped based on semantic similarity by applying clustering
b. The keyterms of each cluster are connected to entities of DBpedia
c. For each cluster, the relations between the keyterms are detected by extracting the h-hop keyterm graph from the knowledge graph, i.e., the subgraph of DBpedia that includes all paths of length no longer than h between two different nodes of the cluster.
d. Then, all the extracted keyterm graphs of the clusters are integrated into one and a Personalized PageRank (PPR)[6] is applied on it to get the ranking score of each keyterm.

Method 2WikiRank[21] is an unsupervised automatic keyphrase extraction method that tries to link semantic meaning to text
a. Use TAGME[8], which is a tool for topic/concept annotation that detects meaningful text phrases and matches them to a relevant Wikipedia page
b. Extract noun groups whose pattern is zero or more adjectives followed by one or more nouns as candidate keyphrases
c. A semantic graph is built whose vertices are the union of the concept set and the candidate keyphrase set. In case the candidate keyphrase contains a concept according to the annotation of TAGME an edge is added between the corresponding nodes
d. The weight of a concept is equal to the frequency of the concept in the full-text document.
e. The score of a concept c in a subgraph of G to be

II. Candidate Pruning

Generic Keywords Filtering

The rule set would comprise of blacklist of keywords that are generic and doesn’t corresponds to any specific characteristic of product category. We can keep a global as well as category specific keyword list. Example keyword blacklist

keyword_blacklist = ["hola", "pick", "pickup", "cash", "collect", "locate", "work", "contact", "use", "locat", "news", "good", "total", "valu", "complete", "list", "sold", "know", "limit", "want", "complete", "near", "offer", "new", "date", "nice", "day", "interest", "ship", "sell", "item", "price", "need", "need", "link", "like", "sale", "includ", "free", "look", "condit", "ship", "www", "need", "great", "machine", "come", "sale", "item", "sell", "ask", "com", "like", "avail", "beauti", "excel", "look", "best", "thank", "meet", "came", "help", "got"] 

Apart from creating a blacklist we can also use statistical analysis to study the spread/dispersion of a keyword across categories. A keyword with low inverse document score or coverage across categories should be automatically filtered.

Another set of rules would be extraction rules that would help us extract brands. These rule set would involve simple pattern matching components that would help us create a weak supervision system on the lines of how SNORKEL[22] works.

For brand extraction we perform following steps and then apply the following set of weak supervision rules

Heuristics Based Filtering

  1. Filter output candidates based on POS Tagging and NER Tagger output
    • is candidate noun or pronoun
    • is NER label ORG
    • is first character upper
  2. Statistical Pruning
    • is length less than 20 and greater than 3
  3. Filter candidates based on block word list
  4. Filter candidates based on dictionary check
  5. Filter Candidates based on syntactic check
    • is candidate ASCII
    • is not all numeric
    • is not in english dictionary

One of the foremost drawback of using a totally unsupervised method for keyword extraction is that we don’t learn from our mistakes once we get feedback on extracted keywords from a Taxonomist. Another issue is that without a model based approach for keyword extraction, we miss out on using approaches like Active Learning that can guide us in creating better training data and hence a better feedback loop to get the contentious data points labeled. The unsupervised keyword extraction techniques are not domain sensitive i.e. they are not tunable to adapt to the data domain or topic and spits out lots of generic keywords that are not important for the domain under consideration. Also the share number of candidates generated are hard to get labeled, so we definitely need a mechanism to rank them in order of importance and get a feedback from domain experts.

Next post discusses semi supervised approaches to improve the attribute extraction process using Graph Neural Networks.


  1. Ward’s Hierarchical Clustering Method: Clustering Criterion and Agglomerative Algorithm
  2. Semi-Supervised Class Discovery
  3. PinnerSage: Multi-Modal User Embedding Framework for Recommendations at Pinterest
  4. TextRank: Bringing Order into Texts
  5. Simple Unsupervised Keyphrase Extraction using Sentence Embeddings
  6. NetworkX: Network Analysis in Python
  7. Enriching Word Vectors with Subword Information
  8. Poincaré Embeddings for Learning Hierarchical Representations
  9. https://rare-technologies.com/implementing-poincare-embeddings
  10. https://radimrehurek.com/gensim/models/poincare.html
  11. https://spacy.io/universe/project/spacy-pytextrank
  12. A Review of Keyphrase Extraction
  13. Kazi Saidul Hasan and Vincent Ng, Automatic Keyphrase Extraction: A Survey of the State of the Art, Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics
  14. Wan, X. and Xiao, J. (2008) Single document keyphrase extraction using neighborhood knowledge. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence, AAAI 2008, Chicago, Illinois, USA, July 13-17, 2008, 855–860. URL: http://wwwaaai.org/Library/AAAI/2008/aaai08-136.php.
  15. Rose, S., Engel, D., Cramer, N. and Cowley, W. (2010) Automatic keyword extraction from individual documents. Text Mining: Applications and Theory, 1–20. URL: http://dx.doi.org/10.1002/9780470689646.ch1.
  16. Danesh, S., Sumner, T. and Martin, J. H. (2015) Sgrank: Combining statistical and graphical methods to improve the state of the art in unsupervised keyphrase extraction. In Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics, Denver, Colorado, USA., June 4-5, 2015, 117–126. URL: http://aclweb.org/anthology/S/S15/S15-1013.pdf.
  17. Florescu, C. and Caragea, C. (2017b) PositionRank: An unsupervised approach to keyphrase extraction from scholarly documents. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 – August 4, 2017, Volume 1: Long Papers, 1105–1115. URL: https://doi.org/10.18653/v1/P17-1102.
  18. Bougouin, A., Boudin, F. and Daille, B. (2013) TopicRank: Graph-based topic ranking for keyphrase extraction. In Proceedings of the 6th International Joint Conference on Natural Language Processing, IJCNLP 2013, Nagoya, Japan, October 14-18, 2013, 543–551. URL: http://aclweb.org/anthology/I/I13/I13-1062.pdf.
  19. Teneva, N. and Cheng, W. (2017) Salience rank: Efficient keyphrase extraction with topic modeling. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 – August 4, Volume 2: Short Papers, 530–535. URL: https://doi.org/10.18653/v1/P17-2084.
  20. Shi, W., Zheng, W., Yu, J. X., Cheng, H. and Zou, L. (2017) Keyphrase extraction using knowledge graphs. Data Science and Engineering, 2, 275–288. URL: https://doi.org/10.1007/s41019-017-0055-z.
  21. Yu, Y. and Ng, V. (2018) Wikirank: Improving keyphrase extraction based on background knowledge. In Proceedings of the 11th edition of the Language Resources and Evaluation Conference, LREC 2018, 7-12 May 2018, Miyazaki (Japan), 3723–3727. URL: http://www.lrec-conf.org/proceedings/lrec2018/pdf/871.pdf.
  22. Snorkel: Rapid Training Data Creation with Weak Supervision
  23. Complex Network based Supervised Keyword Extractor

Posted in Uncategorized | 1 Comment

Talk On Multi Stage Ranking


Posted in Uncategorized | Leave a comment

QUS : Query Understanding Service


The journey of a search query through e-commerce engineering stack can be broadly divided into following phases, search query text processing phase, retrieval phase where relevant products are fetched from indexer and the last but not the least, product re-ranking phase where a machine learning ranking engine re sorts the products primarily based on combination of KPIs like click through rate, add to cart rate, checkout rate etc. The focus of this post would be primarily on the first phase i.e. query text processing via a Query Understanding Service (QUS). I would be discussing the applications and working of QUS in e-commerce search. QUS is one of the most critical service needed to resolve user query and find the key search intent. Among the plethora of machine learning(ML) services working across the engineering stack in any e-commerce company, QUS is usually the first to hold the fort and acts as the backbone ML service in pre retrieval phase.

When a user enters a query in the search box, the first step is to ingest that raw text and generate some structured data from it. The objective here is to find as much relevant information from the query as possible and retrieve the most relevant results for the user.

The search query in e-commerce can contain many clues that can guide us in finding results pertinent to user’s intent. A query like “levi black jeans for men” consist of a un-normalized brand name “levi”, gender “men”, product type “jeans”, color “black” and top level taxonomy category of the query can be Clothing & Accessories. The aim of QUS is to find these attributes, normalize them (levi => Levi Strauss & Co, mens => Men etc.) and send this information to retrieval endpoint to fetch relevant results. QUS would comprises of an ensemble of sub services like query category classification service, query tagging service, attribute normalization service etc. In the case of long tail queries(queries that are not too common and results in either very limited products or null results) the output of QUS can be further used to relax by rewriting it, this process is known as query relaxation. Furthermore we can also expand the query (query expansion) where we can show user results which are near similar to search query e.g. if user search for “light blue tee”, we can expand the results set where color can be either light blue, blue or violet. Also in case of brand sensitive queries, the result can be expanded to near similar brands to provide user exposure to available alternatives in your inventory.

Common Issues

QUS will help move search beyond raw keyword match. In a world without QUS following issues can occur if we depend on pure SOLR retrieval

  1. Wrong product categorization : queries like “hershey cocoa powder” that belong to Grocery category can retrieve fashion products since it has the word “powder” in it.
  2. Retrieval Sensitivity :
    1. Queries like “office desk wooden” vs “office desk wood”, “halloween costumes for girls” vs “halloween costumes girls” can result in different results although they have the same search intent.

B. Solr Provides equal weightage to all the tokens. This may result in a situation like the following where the tokens “range” is getting equal weight as “eggs”. Hence the result set includes “range free chicken”.

3. No query relaxation: too narrow queries are mostly unresolved e.g. “fleece queen size sheets” can return blankets – query should be relaxed to “queen size sheets”

4. Price Agnostic Search: Another feature of QUS is be to extract price information from the query, this either can be done by using either a set of regular expressions or using tagger + normalizer.

Raw Query: black pants under $50

5. Unresolved Brand Variants: For queries like “coke six pack” vs “coca-cola six pack” results can be different.

Search Phases

We can divide the e-commerce search process in two phases

Post Retrieval :

This phase is concerned with the retrieved relevant results corresponding to the query. It is here where we rank the results, add recommendation, add sponsored products to the final list of results.

Pre Retrieval :

This is the phase where we haven’t yet retrieved results from backend indexing engine yet. The only information we got to deal with is the raw user query. Usually this phase comprises of following components (can be separate microservices)

Spell corrector

Intent Detection

Query Classifiers

Category Classifier: This service would classify the query into leaf level categories corresponding to the catalog structure/taxonomy. The output of this service would be a set of leaf level categories. Solr can either filter the results on the basis of predicted set of categories or boost the results for products belonging to predicted category set.

Product Type(PT) Classifier: Usually taxonomies are hierarchical in nature e.g. Under top level Clothing & Accessories category you would have Shirts at level 2 and Men/Women/Children at level 3, formal/casual at level 4 etc. Due to noise in catalog content and semantic issues in catalog structure (near similar leaf categories under different L1s e.g. tea can be in Grocery as well as Office > Pantry items) it is usually better to classify the query into flat hierarchy Product Types e.g. in the context of last query PT would be just Shirts, if query is iphone 9 charger, PT would be “Phone Chargers”

Query Tagger

Just like query category classifier and query product type classifier query tagger is another component of QUS but unlike them it works on tagging individual tokens in a query rather than categorizing the whole string. 

The aim here to successfully detect customers intent and improve retrieval by finding tokens in the query that contribute to key product being searched by customer, brand, gender, color, size, price range etc. This would help us in

  1. refining the retrieval in a faceted manner
  2. resolving long tail queries
  3. query rewriting and query relaxation
  4. product promotion / recommendation 
Architecture: Bidirectional LSTM-CRF Models for Sequence Tagging
Tutorial : https://guillaumegenthial.github.io/sequence-tagging-with-tensorflow.html

Tagger Attributes

Broadly speaking there are two types of attributes, namely global and local. Local attributes are highly specific to particular leaf categories e.g. attributes for Home>Furniture>Table would have attribute key/values like

Top MaterialEngineered Wood
MaterialParticle Board
Furniture FinishEspresso
Local Attributes

Since each leaf level category can have specific attributes that are not applicable to other categories, we can end up with a large number of local attribute key/value pairs. That’s why it is better to not to use Tagger to detect these attributes since we would face performance issues in scaling the tagger to these many attributes.

On the contrary there are other set of attributes that are global in nature i.e. they are not focused on any particular category e.g. size can be found in clothing, in furniture, appliances etc. Although the values of these attributes can be category specific e.g. size in clothing can take values like XS, S, M, L etc. while size in home appliances >Microwave category could have valid size values as Compact, Mid-Sized, Family Size etc. They are present across categories or at least in bunch of categories. It is better to use a tagger to detect these attributes. The table below comprises of what we call as global attributes.

Attribute Key DescriptionAttribute Value
BrandBrands are companies and their subsidiaries that offer productsony, philips, great value, coke etc.
Gendermen, boys, women, unisex, girl etc.
ColorSpecified colors, also includes materials that represent colors such as gold, silver,and bronze
CharacterCharacters are recognizable entities that exist in multiple Brands and Product LinesBatman, Golden State Warriors, Taylor Swift, UCLA Bruins etc.
PT Descriptor Features pertaining to the product as well as media titleswith shelves, led lightbulbs, round dining table
Product LineProduct lines from brandsplaystation, sonicare, air jordans etc.
MiscellaneousAll other relevant tokens including: themes, versions, years, and model numbersstar wars, 2013, paris, ucla, in-store, highest rated etc.
Price$, dollars, bucks, sale, clearance
Agea. Age Value – Numeric value for age (e.g. 8, 12)
b. Age Unit – Context for defining value (e.g. month, year)
c. Age Type – Qualitative measurement for age (e.g. baby, teenage, elder, young, jr)
Sizea. Size Value – Numeric value and word representation for sizing (e.g. 3, 120, double)
b. Size Unit – Context for defining size (e.g. oz, lb, gb)
c. Size Category – Grouping for size units (e.g. weight, volume, length, diameter)
d. Size Type – Qualitative measurement for size (e.g. small, medium, large, 4xl, mini, giant, tall, short, wide, slim)
Quantitya. Quantity Value – Numeric value for quantities
b. Quantity Unit – Context for defining quantity (e.g. piece, sheets)
c. Quantity Type – Qualitative measurement for quantity (e.g. value size, bulk, set)
Global Attributes

Attribute Normalization

Once the tagger detects a mention in query and tags individual tokens the next step involves normalizing these tokens. For attributes types like color, normalization is pretty straight forward e.g. {red, light red, pink, .. etc} can be mapped to one color family with normalized name RED, similarly for price too we can create a standardized denomination using a set of regular expressions. With normalization we are aiming to standardized the attribute key/value pair w.r.t the values in catalog. Here the prior requirement is that products in catalog would have canonicalized values for attributes e.g. all men shirts would have size attribute mapped to only a predefined values {XS, S, M, L, Xl, XXL ..}. Now once we detect size attribute in a query like “mens small checked shirt”, the next step is to normalize the size token “small” to normalized attribute value in catalog “S”. This would help us in either making a faceted query to SOLR or boost products in retrieval where size attribute is “S”, thereby enhancing the retrieval quality.

Numerical attributes like price, quantity (e.g. 1 gallon milk), Age (toys for 8-12 years old) can be handled with regular expression driven approach. Once we detect category and product type of a query, we can apply set of regular expressions applicable for only those categories and PTs to extract numerical attributes e.g. for query like “2 gallon whole milk”, the category can be “Grocery>Milk>Whole Milk” and PT can be Milk, once we know these values we can apply a set of regular expression created exclusively to handle the grocery/milk quantity/amount normalization. The following set of queries have price attribute values as 20 that can be easily extracted using a couple of regular expressions.

a. “tees under 20”

b. “tees under $20”

c. “tees under 20 dollars”

d. “tees under 20 usd”

Overall attribute normalization can be achieved using following approaches

  1. Regular Expressions based methods
  2. Rule based methods
    1. This is a simple yet very effective approach. Before jumping to nlp based methods a good way to get some quick wins and draw a performance baseline is to manually create rules for normalization.
    2. A sql like table can be created with each unnormalized attribute mapped to its normalized variant
    3. A simple lookup in the table can generate normalized attribute values.
  3. Classification Based Approach:
    1. The key drawback of the rule based approach is that it is not scalable. It would too much manual effort to analyze the queries and find varied patterns in them and create explicit rules to map them to normalized values.
    2. But the last approach would provide us a labeled data set to work with i.e. attribute values mapped to their canonicalized versions.
    3. The above data set can be used to create an attribute classifier. The difference between this and category classifier that I mentioned earlier is that here we would be classifier the tagged mention (e.g. Product Line) and in the earlier we were classifying the whole query string.
    4. Entity Linking based approach
      1. Entity linking is wholesome topic that I plan to write about in a separate post. But to provide gist of the idea Entity Linking is a process to detect mention strings in a text that may correspond to names entities (like tagging finds mentions and tags them to attribute keys like Brand, Size etc.) and then tries to map them to the entities in knowledge base. This method can be useful while trying to detect brands in query as well as in product title and description.
      2. Although there are neural architectures that can detect the mention string and then link the mention to the best candidate entity, in the next section we would discuss a much similar model based approach.

Entity Linking Based Brand Normalization

Let’s say we have a mention string tagged as Brand in the search query. The entity linking task can be broken into two steps: candidate generation and candidate ranking. Candidate generation means fetching normalized brand candidates that are syntactically or semantically similar to the mention string, e.g for search query “paw-patrol fire truck”, the tagger would generate mention for Brand as “paw-patrol” and the candidate generation phase can find a set of syntactically similar brands from catalog for category Toys. Traditionally an information retrieval based approach for candidate generation has been used like BM25, a variant of TF-IDF to measure similarity between mention string and candidate brands and their description. Once we have a set of candidates we can rank them in the Candidate ranking phase.

A context aware ranking can be done by using the left span of mention, mention string, right span of mention as separate inputs to a model. We can create a model to score the contextual similar of a mention string to description of a brand. For getting description of brands(and hence their representations we can either create a custom dataset or get brand pages from wikipedia and learn a representation for them using the title, introduction of the brand).

In Learning Dense Representations for Entity Retrieval authors uses model architecture similar to sentence similarities architecture to put the entity description and mention description representations near each other in the same domain. Furthermore this kind of approach is highly preferable since the brand representations can be pre computed and since in this architecture there is no direct interaction between the encoders on each side, we can just compute the contextually aware representation of the mention string and take a dot product between it and pre computed brand representations. This enables efficient retrieval, but constrains the set of allowable network structures. The image is from the mentioned publication depicting how the components from the query string and entity description can be used to find similarity between the two.

Brand Resolution In A Noisy Catalog

It may happen that brand names are not normalized in the catalog. In this case some brand e.g. Coca-Cola can be referred by different products in catalog using different variants e.g. coke, coca cola, coca-cola, coca + cola etc. Here we can’t normalize brand in query since brand names in catalog aren’t normalized. So instead of canonicalizing the brand in query we should aim to fetch products that refer to any variant of the searched brand.

Brand Normalization Process Flow

A simple yet handy way to canonicalize brand names in queries would involve following steps

  1. Parse the catalog to create a mapping of product type : list of available brand
    1. E.g. coffee maker : [black decker, black & decker, black and decker, black + decker, Mr. Coffee, Keurig,  …..]
  2. Use Product Type classifier to get the query PT
  3. Use query tagger to get the token/tokens tagged as B-BR and I-BR
  4. Now match the tagged tokens with list of brands corresponding to the predicted product type and use string similarity approaches to select candidate brands
    1. Trigram-Jaccard
    2. Levenshtein Edit Distance
    3. JaroWinkler
    4. FuzzyWuzzy


For a query like “coffee maker black n decker” the predicted Product Type can be “Coffee Maker” and mention string tagged as brand can be “black n decker”. A lookup in PT to brand list map can return list of valid brand variants in catalog for PT “Coffee Maker” as [black decker, black & decker, black and decker, black + decker, Mr. Coffee, Keurig, …]. Now by using edit distance of either 1 or 2 to we can find candidate brands from brand list mapped to product type coffee maker as [black decker, black & decker, black and decker, black + decker]. Later on Solr can boost all these brands while retrieving the results for this query. In this approach we don’t even needs brands to be normalized in the catalog since we can boost all variants in one go.

Connecting the Dots

Once QUS fetches all the key insights from the query, the results are forwarded to SOLR. The prerequisite is that catalog is indexed with separate dimensions for Category, Brand, product type etc. For retrieval we can use a weight based scheme where higher weight based boost is provided to predicted categories, product types, brands etc. For attributes like category and PT we can even make a faceted search call while adding boosts for predicted brand, product line, size etc. Furthermore we can also add a relaxed secondary query to the main query so that the recall can be high. This will help in resolving long tail queries and null result queries. The product ranker layer can take care ordering the relaxed query supplemental results w.r.t to products returned from main query. More advanced techniques like creating a separate optimization service to predicts weights for attributes to be boosted based on user query can further enhance the relevance of returned results e.g. for clothing query the SOLR weight prediction algorithm can provide more weights to brand and price rather then style/pattern.

Posted in Uncategorized | Leave a comment

Demand Forecasting – Online Advertisements

Objective: Given an ad campaign’s targeting constraints, ad slots, start date and end date we want to forecast the available inventory for that line item. If the ad slots in question are being targeted by other active campaigns during the concerned time period then the forecasting system should project demand taking that latter account.


To effectively forecast the demand we can use a two phase approach defined below

1. Forecast overall inventory from a particular ad slot
2. Project the proportion of the forecasted overall inventory available for the new campaign given some active campaigns targeting the ad slot.

  • Trend Detection : In this step we will try to analyze the global trend e.g. are the number of requests linearly increasing, if so at what rate.
  • Seasonal Behavior: This include weekly and monthly fluctuations in inventory particularly if we have an e-commerce publishers. This step will involve tapping the monthly seasonal fluctuations and include the details in the overall projected inventory.
  • Noise Removal: Here we will try to smooth out the series via removing the inherent noise in the data.
  • Impulse Handling: This consists of unprecedented and sometimes unaccounted sudden rise in inventory which if not handled properly will lead to over projection of inventory. Sometimes publishers introduce new features( e.g.some production related issue can cause unserved requests) or due to some unpredictable event (in case of news site) huge peaks (impulses) are often observed in the page views time series, our aim is to somehow detect that the impulses are not long term trend and ignore them while projecting available inventory.

To accomplish the above mentioned goals, the research phase will involve but not restricted to testing the performance of time series algorithms like

  • Holt-Winters
  • Kalman Filters (State Space Models)
  • Elastic Smooth Season Filtering
  • Discrete Fourier Transforms based forecasting techniques
  • regression models for global trend detection etc
  • Some heuristic based techniques (e.g. moving average) to handle impulse and noise

Experiments and Observations

I did some descriptive analysis on impressions delivery plots of Ad Id  X from publisher XX on the data we fetched from Google DFP API. Interestingly just by visual inference we could infer weekly pattern in data. To find weekly pattern (weekly seasonal behavior) theoretically month is taken as one period in time series and to find monthly data pattern (monthly season behavior) year is considered as one period unit. By analyzing yearly data we found global trend i.e. if overall impressions rate to a site is increasing or not. But since we only have data from July 2013 to August 2014 we cannot infer anything statistically relevant about the monthly season behavior and trend, but we can clearly see weekly seasonal behavior.

2014 monthly impressions plot – Jan(01) to Oct(10)

Screen Shot 2018-02-20 at 3.05.14 AM

Initial Experimentation

We trained the model, tested it and tuned it using XX(publisher) DFP impression data. Once done we compared it with the daily forecasted impression values with actual delivered impressions. We will be updating the model parameters on daily basis so as to effectively incorporate new insights from data.

For creating initial baseline model for creating benchmark performance statistics we zeroed in on a variation of Holt-Winter’s forecasting algorithm with the additional feature of multi season forecasting i.e. along with daily trends within a week, e.g. learning that more impressions are received in weekends compared to weekdays, it will also adjust according to monthly seasonality. Since we didn’t have even 2 years of data, the algorithm would not be that efficient while forecasting accurately when it comes to dealing with monthly variations (though it will do far better than naive approaches e.g. Google DFP’s approach of using just last 28 days data without considering seasonal trends) but I hope it will at least learn good amount from weekly data and distinguish patterns from weekday and weekends impression delivery.


Overall Features of the algorithm

  • Trend Detection : Finding if overall impression delivery is increasing over time or decreasing
  • Monthly Seasonal Decomposition : Find seasonal traffic behavior e.g. if traffic is much more during year end use this information for better forecast during following year’s end.
  • Weekly Seasonal Component : This will involve learning from the fluctuating daily traffic behavior and learning on which particular days of week more traffic is received.

Experiment 1.1 : Holt Winters Additive – 12*7 Seasons : In this prototype we tried to learn global trend and seasonal patterns (pattern for each day of week day, Monday to Sunday for each month, so in total 12*7 distinct seasons). Since overall we didn’t have much data points to train for 84 distinct seasons ( only 4 data points for each season) the results have significant variance but we are still capping overall seasonal trends, day wise and monthly shifts and overall increasing global trend.

The plot below shows the predicted(one day ahead) and real impression. The training data used belongs from 07 July 2013 to 15 Oct 2014.


Screen Shot 2018-02-20 at 3.18.27 AM

Experiment 1.2 : Holt Winters Additive – 12 Month Seasons

*In this approach we smooth out the weekly fluctuations by taking centered moving average with period 7.

In this manner we removed the noise in data due to weekend and now we can concentrate on learning from the monthly behavior. The RMSE of test data reduced significantly with this approach though the only caveat is we will have huge errors if we want to predict #impressions for one particular day. Though the algorithm will perform good if we want overall #impressions for a time period.

Screen Shot 2018-02-20 at 3.20.04 AM

test data = overall impressions severed from October 17,2014 to October 27, 2014 = 346246

smoothed value of test data = 372105

Screen Shot 2018-02-20 at 3.21.29 AMScreen Shot 2018-02-20 at 3.21.54 AM


COVARIATE SHIFT: If the seasonal behavior of impression corresponding to some ad-unit changes totally then the algorithm is having problem adjusting to this abrupt change

Screen Shot 2018-02-20 at 3.24.13 AM.png



Screen Shot 2018-02-20 at 3.24.45 AM.png

Screen Shot 2018-02-20 at 3.25.08 AM.png

Screen Shot 2018-02-20 at 3.25.35 AM.png

Screen Shot 2018-02-20 at 3.25.57 AM

Screen Shot 2018-02-20 at 3.26.30 AM

Screen Shot 2018-02-20 at 3.27.07 AM

Screen Shot 2018-02-20 at 3.27.40 AM

Screen Shot 2018-02-20 at 3.29.45 AM.png

Screen Shot 2018-02-20 at 3.30.24 AM.pngScreen Shot 2018-02-20 at 3.30.47 AM

Results of modeling level of series

Screen Shot 2018-02-20 at 3.31.52 AM.png

Screen Shot 2018-02-20 at 3.32.14 AM.png

Screen Shot 2018-02-20 at 3.32.54 AM.png

predicted value. I will try to work around to find some sub optimal value and use it to draw CIs.

Generalized Additive Model (GAM)

Extracts from “Forecasting at Scale”, Sean J. Taylor et. al. 

Screen Shot 2018-02-21 at 12.43.48 AM

The above mentioned paper is a really good study on how to model a time series as a GAM with change point detection to tackle covariate shift.

It further explores following notions while modeling Trend

  1. A saturating non linear growth function
  2. Linear trend with change points  
  3. Automatic Changepoint Selection 

For modeling Seasonal component the paper uses Fourier series to provide a flexible model of periodic effects. 

Screen Shot 2018-02-21 at 12.57.55 AM

The third model is for Holidays and Events to account for calendar effect(spikes and drops). 

The above methodology has been implemented in Facebook’s Prophet library. In the next section I try to show it’s use on a toy data set.

Data Set: 

Screen Shot 2018-02-21 at 1.02.52 AM

Advantage of Facebook Prophet 

  1. The formulation is flexible: we can easily accommodate seasonality with multiple periods and different assumptions about trends.
  2. Unlike with ARIMA models, the time series measurements need not have a regular period and we do not need to interpolate missing values to fit.
  3. Automatic Bayesian change point detection
  4. We don’t need to explicitly split time series in level and seasonality components in a linear fashion
  5. Models Level/Trend via a Non-linear growth function
  6. Periodic Seasonality modelled via Fourier Series
  7. Holiday Modeling: Incorporates list of holidays into the model in a straightforward way assuming that the effects of holidays are independent.
  8. Can incorporate change in seasonality behavior e.g. if we used to monitor peeks on Monday but this behavior changes the Prophet can detect change point and incorporate the effects in seasonality component.

Fitting the model and forecasting 

Screen Shot 2018-02-21 at 1.07.13 AM

Extracting Components:

Screen Shot 2018-02-21 at 1.08.25 AM.png

The following attached pdf link (generated from ipython notebook) showcases comprehensive analysis of the toy time series used above. In the attached doc I explore various ways of modeling components of a time series and highlights their corresponding pros and cons.

Timeseries Analysis IPython Notebook(pdf)

Posted in Uncategorized | Leave a comment

CTR Prediction System – Design Doc

The Anatomy Of Large Scale CTR* Prediction System

* With little or no modifications the proposed system design and algorithms can be used for optimizing other metrics like Cost Per Viewable Completion, Cost Per Completion, Cost Per Engagement, Cost etc

Aim: To build a scalable distributed machine learning framework based on stacked classifier comprising Gradient Boosting Machines and Logistic Regression trained with Online Stochastic Gradient Descent. This system is specifically designed to achieve high performance for display advertising campaigns with a varied range of targeting criteria. The final aim of the prediction system is not only restricted to achieve high performance but also extends to ease of deployment, frequent model updates in online fashion, highly scalability and distributed learning.

A Glimpse Into The Modeling Aspects:

Click and conversion prediction for display advertising presents a different set of challenges.

  1. Sparse Features : Categorical features like domain, publisher, advertiser ID, creative ID, cookie ID etc when converted to binary representation with One Hot Encoding leads to very sparse features.
  2. Biased Training Data: Normally the number of instances with click are much less (less than 0.1%) compared to negative training instances. This leads to a biased class problem which further complicates the training of a conventional classifier.
  3.  Feature Generation: Usually we will generate different permutations of features e.g. quadratic features and combination of features to help the classifier generate a non linear decision boundary. But though it increases the performance of the system it leads to explosion in dimensionality of the features and makes it harder to train the model.
  4. Tuning Log Linear Model: L1 vs L2 Regularization ( conventional wisdom suggests that go with L1 since it preserves sparsity)
  5. Tuning Stochastic Gradient Descent: Finding optimal rates for weight update
  6. Subsampling: Finding optimal sampling rate for the the negative class to get a balanced data set
  7. Periodic updates to model: The Logistic Regression model will be trained using online stochastic gradient descent. This will make training faster and scalable in a distributed manner. Though we will need to periodically train the model after few hours to incorporate new data points.
  8. Gradient Boosting Machine: I will be training GBMs separately to generate features that will be consumed by the log linear model. Since training GBMs take comparably much more time (I will have to test the training time vs (#trees and training data size) for GBMS using spark cluster) we will have to schedule a daily job to update the GBM.
  9. Dealing with new ads/creatives: We will have to generate a separate generalized model to handle new advertisers/creatives. Though the performance of this model will not be up to the benchmark but it will still help us to exploit historical patterns and collect better training data for the new ad campaign.
  10. Generating Historical Features: We will be adding historical performance of advertiser and user as additional features to our feature vector.
  11. Mobile Advertisement: Whether a separate model for mobile/app based ads since the feature set will involve many new elements e.g. device id, device type, app
  12. Feature Hashing: Reducing dimensionality of generated feature vector
  13. Explore and Exploit: How to sometimes show wrong ads to wrong users with the sole intention of exploring new users, finding new patterns and discovering covariate shift. If we only shows ads according to our system then next day we will get a highly biased training sample.

Advance Non Linear Approaches 

I will avoid using fancy ensemble techniques (the winners of Avazu and Criteo CTR prediction challenge used 20 model ensemble and other many more hacks) and algorithms like Field- aware Factorization Machines that have shown to perform a bit better (increase the performance metric score by fourth decimal place) in Kaggle competitions but they lead to grave overfitting and are not at all scalable in production. We need a mechanism that can be easily trained on a cluster and can be efficiently updated within few hours. After going through more than a dozen research papers and other literature I narrowed down the final model to use Online gradient descent based log linear model with feature generation via gradient boosted machines. With appropriate computing power (Spark cluster ) we should be able to train the model every few hours hence exploring new data. Many people have used Vowpal Wabbit to train log linear models in Kaggle competitions. Though it takes comparatively much less time when using Stochastic Gradient Descent but it took them days in generating basic features and other data munging stuff. Hence I plan to go with Spark which claims to provide lightening fast computation due to its in memory computation framework.

One interesting approach mentioned in ad click prediction literature to solve the problem of exploration is Bayesian Logistic Regression. The problem of exploration can be explained using the following example, imagine that we trained our CTR prediction model using 2 days of training data. On the third we used the trained model to only severe ads to request that have high predicted CTR. Now the third day training data will only contain training instances that had high predicted accuracy, i.e. we will be getting a narrow/biased training sample. If new segments requests were encountered or the segment behavior changed our model will not be containing necessary training data to adapt to the mentioned change. So we need a procedure to gradually explore along with exploiting high performing segments.

When it comes to “Explore And Exploit” paradigm I have a strong proclivity towards Bayesian methods. Bayesian Logistic Regression takes prior distribution on the weights of Logistic Regression and tries to gradually update the weights using the incoming data stream. This method is high suited to online stream learning. Finally we draw a sample (using Thompson Sampling) from the weight’s posterior distributions and use these weights to predict the performance of a segment. I have used Bayesian Bandit based Thompson Sampling techniques to design a RTB optimal bid price detection algorithm and hence I am handsomely introduced to the underlined statistics.

Graepel et. al[3] gives a detailed account of how this method is nearly equivalent to Logistic Regression trained with Stochastic Gradient Descent.

  •  With Gaussian Prior on weights Bayesian Logistic Regression is equivalent to Logistic Regression trained with Stochastic Gradient Descent with L2 regularization.
  • With Laplacian Prior on weights Bayesian Logistic Regression is equivalent to Logistic Regression trained with Stochastic Gradient Descent with L1 regularization.

In the initial version I will not be using Bayesian Logistic Regression(BLR). Instead I will be going with Logistic Regression(LR) trained with Stochastic/Online Gradient Descent(SGD). Research has shown that the LR trained with SGD inherently leads to exploration since the gradient deviates from finding the optimal wights instantaneously (convergence time is more), also due to anomalies in data the gradient deviates much often which in turn makes this algorithm inherently suited to exploration. So sometimes we will predict sub optimally i.e. after hourly updates our algorithm will predict accurate ctr for some segments but this can act as blessing in disguise since it will let us explore more unseen segments which will be part of training data set in next training iteration.
Research[1,3,4] has shown that these algorithms performs nearly on same scale. But in later versions I will definitely love to try BLR as it has more robust update rule.

Model Training And Tuning Procedure
This will be the most critical and time consuming step. Once done we will be just hard coding the various tuned model parameters and generated features in the model generation job. I have few ideas to generate more useful features like forming clusters of existing users w.r.t their behavior and forming clusters of different advertiser ids, and including respective cluster ids in the feature vector. This will act as a hierarchical feature based on historical behavioral data and may lead to increased performance. But all the experimentation will be constrained depending on the project timeline.
For modeling I will be following the main findings and steps defined in the four landmark papers on CTR prediction by Facebook, Google, Microsoft and Criteo together with the results and findings from the two Kaggle competitions on CTR prediction by Criteo and Avazu

1. Practical Lessons from Predicting Clicks on Ads at Facebook , Xinran et. al., ADKDD’14.
2. Ad Click Prediction: a View from the Trenches , McMahan et. al., KDD’13.
3. Web-Scale Bayesian Click-Through Rate Prediction for Sponsored Search Advertising in
Microsoft’s Bing Search Engine, Graepel et. al. Appearing in Proceedings of the 27th
International Conference on Machine Learning
4. Simple and scalable response prediction for display advertising, Chapel et. al., YYYY ACM
5. http://mlwave.com/predicting-click-through-rates-with-online-machine-learning/

6. https://github.com/songgc/display-advertising-challenge
7. https://www.kaggle.com/c/criteo-display-ad-challenge/forums/t/10322/beat-the-benchmark- with-less-then-200mb-of-memory/53674

Main Modeling Steps:

  1. Feature Generation: Try quadratic, cubic and different combination of features. Check the performance and use the best combination. The continuous features can be binned and we can treat the bin index as a feature.
  2. Historical Features: Generate the following historical features to see if they improve performance
    a. Counting Features:
    i. Device_IP count
    ii. device id count
    iii. hourlyusercount
    iv. user count
    v. hourly impression count
    vi. #impressions to the user in a given day
    vii. #impressionstouserperpuborappid
    viii. time interval since last visit
    ix. #impressions to user per pub or app id in last hour
    x. features that occurred less than 10 times will be converted to “rare” category xi.
    b. Bag Features
    i. user, pubid or app id, bag of app ids
    c. Click History:
    i. #clicks (over all) by a user
    ii. #clicks (over all) per pub/app id
  3.  Feature Hashing:
    The issue with the dummy coding presented above is that the dimensionality d can get very large when there are variables of high cardinality. The idea is to use a hash function to reduce the number of values a feature can take. We still make use of the dummy coding described in the previous section, but instead of a c-dimensional code, we end up with a d-dimensional one, where d is the number of bins used with hashing. If d < c, this results in a compressed representation. In that case, collisions are bound to occur, but as explained later, this is not a major concern.
    When dealing with several features, there are two possible strategies:
    (1) Hash each feature f into a df -dimensional space and concatenate the codes, resulting in df dimensions.
    (2) Hash all features into the same space; a different hash function is used for each feature.
    We use the latter approach as it is easier to implement. The tuning procedure will consist of finding the optimal value of the size of reduced dimensions df.
  4. Tuning Gradient Boosting Machines : This step involves train the data set on GBM and generating features from them. We treat each individual tree as a categorical feature that takes as value the index of the leaf as instance end up falling in. The final representation of the input feature vector will be in binary coded format.                 Tuning GBMs:                                                                                                                          •  shrinkage • number of trees • interaction depth Our goal is to find optimal value of all these parameters via using k fold cross validation.

5. Learning Rate Parameters of Stochastic Gradient Descent:

Per-coordinate learning rate has been shown to achieve the best accuracy. The learning rate for feature i at iteration t is set to

per feature learning lare

per feature learning rate

6. Learning weights and regularization parameter of Logistic Regression using cross entropy cost function
* To Do: I have to study and test Follow The Leader Regularization — It has been recently published by Google[2] and is acclaimed to be performing a bit better than L1 regularization on sparse data.
7. Finding optimal negative sampling rate: Since we have a class imbalance problem we need to find a subsampling rate parameter to sample from the negative (no click) instances. We need to experiment with different negative down sampling rate to test the prediction accuracy of the learned model. We can vary the rate in {0.1, 0.01, 0.001, 0.0001} can check the effect on final cross entropy achieved.

Infrastructure And Data Pipeline
I will have to discuss more about this section with you in person. So putting this section in a very broad way(rough sketch), we will be needing following batch jobs
1. Hourly SPARK job to process the RTB logs (read logs from HDFS and create training segments.
2. Hourly SPARK job Preprocessing, features and training data generation job (one hot encoded vectors).
3. Daily SPARK job to generate Gradient Boosted Machines Model (if training on spark cluster is not computationally expensive may be we can train this model much more often)
4. Hourly SPARK job to generate Logistic Regression Model.
5. Job to deploy the updated model into production
6. Daily job to generate a generalized model to handle new advertisers/creatives.

Posted in ctr, machine learning, online ads, Uncategorized | 3 Comments

Of Bandits And Bidding

Real-time bidding(RTB) refers to the buying and selling of online ad impressions through real-time auctions that occur in the time it takes a webpage to load. Those auctions are often facilitated by ad exchanges or supply side platforms(SSPs).  A RTB agent has a set of active ad campaigns each with its own targeting criteria. So when an ad request is broadcasted by an ad ex or SSP, each RTB agent has to take two important decisions, first is given the segment (a segment is just the unique set of features of the incoming request e.g. geo location, content category, ad slot size, ad slot position etc) which campaign to serve from list of active campaigns and the second one is, how much to bid at.

The first problem is dealt by smooth pacing algorithms. They involve allocating/spreading out the daily budget of a campaign throughout (if possible uniformly) the 24 hours time line. The main aim of pacing algorithm is to allocate the budget of a campaign  to different time periods in such a way that the campaign is able to exhaust the daily budget as well as impressions delivery is not skewed or biased. In this article I will not be discussing the details of pacing algorithm, rather I will dig into the intricacies of bidding algorithm, in particular how Bayesian Bandits can be used to determine optimal bidding price in real time.

Under The Analytical Hood 

We are starting with the assumption that given a segment S (impression request), N RTBs are actively bidding on it. We are unaware of the number N and this number is a dynamic parameter. It may happen that the campaign running on one of the competitor RTB targeting the concerned segment ends, and we are left with N-1 bidder, on the other hand new bidders can join the bidding process thereby increasing the number of active bidders.

Furthermore an impression has relative significance for each RTB agent. Let’s say some RTB has an active campaign that is only targeting impressions belonging to one particular segment and this campaign has high CPM and huge daily budget. Now this RTB will bid higher compared to rest of the bidders when an impression belonging to that particular segment is received.

Another interesting aspect that makes this problem more challenging is that if an RTB lost the bid, it will never get to know the auction winning price.

  • The number of players in the game are dynamically changing
  • The bid process response is either 1 (you won the bid) or 0 (you lost the bid)
  • At some particular bid price for a segment we have an associated win rate distribution.
  • The payoffs matrix is unknown i.e. we don’t know how important that segment is to the competing RTB.

We are further making an assumption that given an impression request a Pacing Algorithm has already selected a campaign and the win rate to achieve. There will be a specific win rate associated for each campaign and segment pair. Since each segment has different available inventory we can distribute our budget accordingly. So even though in next thirty minutes we are expected to receive 10000 impressions belonging to a particular customer segment we needn’t win them all. Based on all active campaigns targeting them, their respective budgets and availability of all other segments that can be targeted too our pacing algorithm outputs a win rate value. Now we will try to analytically formulate a model for the bid price prediction problem. I will try to be as much comprehensive as possible for the benefit of the uninitiated reader

Symbols used and assumptions made : 

  • We denote the parameters of the ith competing bidder for segment s by Bidder(i,s).
  • Each Bidder(i,s) is assumed to draw it’s bid price at some time ‘t’ from a Gaussian Distribution with mean=μ(i,s) and some constant standard deviation = σ
  • We assume the standard deviation, σ is very small, but over time a bidder can change  mean parameter of bid price.
  • From Game Theoretic standpoint this is a Game of incomplete information i.e. each bidder will only get to know the optimal win price only if he wins the bid and will never know what other bidders are bidding. Every player will have different set of information regarding the payoffs.
  • The Payoffs are continuously changing i.e. the winning rate of bids on a constant bid amount (a fixed strategy) is changing.
  • Each bidder i, bids on a proportion of incoming request from SSP for a particular segment s, therefore even if some bidder is bidding way less he/she will win some times since it may happen that other RTBs who usually bid higher for segment s haven’t bid on some particular requests. Lets  denote the normalized bidding rate of a player i on segment s  as pi,s . So we can also say that pi,s  represents the probability of bidding for an RTB  i on segment s.
  • Since a bidder can increase or decrease its pacing rate depending time of day and remaining budget of the campaign pi,s   is a dynamic parameter. 

Win Price Distribution 

The Win Price distribution at time t will be a Gaussian Mixture Model with probability density function(pdf) as

pdf(x) = p1,s   * N(x|mean=μ(1,s), sd= σ) +  p2,s   * N(x|mean=μ(2,s), sd= σ)  + …………………….+ pn,s   * N(x|mean=μ(n,s), sd= σ

here n refers to total number of bidder and x is the bid price.The above distribution will give us the probability density of the bid price. 

So given a bid price x the above pdf will give us probability density of winning the bid. 

To solve the above problem i.e. to derive the estimated values of all the distribution parameters pi,s  and mean=μ(i,s)t , sd= σ  we need to generate a sample (points) from the distribution and then calculate the parameters using Expectation Maximization. In the figure below the distribution is sum of two Gaussian Distribution. We can imagine that there are two bidders, each with bid price derived from one Gaussian distribution and each bids only on the proportion of requests. So the one with higher bid price always win on the bids made by it while the other won only winds a proportion of bids i.e. when the earlier bidder doesn’t bid. So the overall distribution of winning price will look like as follows

img1RTB1 = Green

RTB2 = Red

RTB1 has higher mean bidding price, around $18 CPM but has lower variance, whereas RTB2 has lower mean bidding price of approximately $11 CPM but has higher variance. Now given a bid price we can get an estimate of winning rate over a long run.

*So sometimes RTB2  wins since it bids more than RTB1 while other times it wins because RTB1 didn’t bid on those requests. 


Now to calculate all the known parameters and draw this distribution we will need to get a sample of data points from this distribution, in our case these are the winning bid price over a significant interval of time.

  • In our case we never get to know the winning bid price if we lose the bid. Even if we win the bid many exchanges doesn’t reveal the second highest winning price since they don’t use Vickery auctions.
  • To use EM (Expectation Maximization) we should know how many RTBs are taking part in the auction, but we have on way of knowing the number ‘n’ used in the pdf.
  • Another important issue is the temporal aspect of the distribution. It may happen that RTB1 or RTB2 stops bidding on the segment (may be their respective campaign targeting the segment finishes) or they increase the bidding rate, resulting a major change in the shape of the distribution. This kind of behavior is much expected in RTB environment.
  • So basically we can’t use the approach of analytically solving the Gaussian Mixture Model using  Expectation Maximization.

Reinforcement Learning : Explore And Exploit 

To solve the above defined mixture model we will be using a variant of a technique known as reinforcement learning that works via allocating some fixed amount of resources to exploration of the optimal solution and spends the rest of the resources on the best option. After spending due time on various explore and exploit techniques like Multi Arm Bandit, Bayesian Bandits  etc I settled on using Boltzmann Exploration / Simulated Annealing technique.


Bidding Algorithm : Bayesian Bandits

We will be using Bayesian Bandit based approach  using Thompson sampling for selecting optimal bandit [5]. Though bayesian bandits are mostly used in A/B testing and ad selection based on CTR, according to our research their usage suits the requirements and dynamics of RTB environment. In [6] the authors compares the performance of Bayesian Bandit techniques with other bandit algorithms like UCB, epsilon greedy etc with the aim of selecting best ad to deliver to maximize CTR. All the referred researches show that Bayesian Bandit based method outperforms other bandit techniques w.r.t. total regret, proportion of best armed played, and convergence time and readily converges to best choice/action in the explore-exploit dilemma.

Please go through the references 3, 5 and 6 for getting comprehensive look into bandit based techniques. Henceforth I will be explaining the application of Bayesian Bandit method to RTB bidding skipping the underlining theoretical details.

As described in earlier section (titled win price distribution) that the distribution of win rate and bidding price can be a mixture distribution with latent parameters. Since we will never know the winning price of a lost bid we can’t solve this optimization problem in more of an closed form analytical way. Therefore we will try to discover the optimal bid price to get the desired win rate by exploring the bid price space.

Let’s say a campaign has cpm of $5. We will divide the bid price space into different bins as follows [$1, $1.5, $2, $2.5, $3, $3.5, $4, $4.5]. Now we will explore over these bins to discover the corresponding win rate.

Experiment Scenario :  

Scenario 1:

RTBS: We have 3 hypothetical RTBS bidding on a particular segment, each generating bid price via a Normal Distribution with the following bid parameters

RTB1: mean bid price = $3, standard deviation = $0.1, proportion of bids placed on incoming requests = 70% i.e. on an average this rtb bids on 70% of all requests sent by the exchange

RTB2: mean bid price = $4, standard deviation = $0.1, proportion of bids placed on incoming requests = 50%

RTB3: mean bid price = $5, standard deviation = $0.1, proportion of bids placed on incoming requests = 30%

Target Win Rate required by our test campaign = 40%

CPM = $5

Given a list of bid price choices from [$1, $1.5, $2, $2.5, $3, $3.5, $4, $4.5]  bins the aim of the algorithm is to find the bin where the win rate is closest to 40%.

Examples : 

  • Now if we bid at $3.5 cpm then the probability of winning will be = (probability that RTB3  doesn’t bid on that request) * (probability that RTB2 doesn’t bid at that request)
    = (1-0.3) *(1-0.5) = 0.35
    i.e. if we bid $3.5 we will win 35% of the time.
  • On the other hand if we bid $4 cpm then out probability of winning is = (probability that RTB3  doesn’t bid on that request) * (probability that RTB2 bid less than $4 at that request) + (probability that RTB3  doesn’t bid on that request) * (probability that RTB2 doesn’t bid at that request)
    = (1-0.3) * 0.5 * 0.5 + (1-0.3) *(1-0.5)
    = 0.175 + 0.35
    i.e. If we bid at $4 we will be winning 52.5% of the time.

Since the required win rate is 40% and bid price of $3.5 is giving us the win rate closest to the target win rate the algorithm should win most of the bids at $3.5

Experiment Results: 

I did a simulation of the above defined scenario for 400 trials. At each request the algorithm selects a bid bin/bucket based on its assessment of win rate of that bin. Closer the win rate of the bin is to target win rate more bids will be made at that price. After each bid the algorithm will update the winning probability. After some trials the algorithm will get more and more confident about its assessment of the win rate of each bid price and will finally select the best bid price for all further bids.

scenario1.1Figure 1.1

scenario1.2Figure 1.2

scenario1.3Figure 1.3

Figure 1.1 shows that for some initial runs (around 100 bids) the algorithm had enough information to make intelligent guesses about winning rate of each bid price. Once the exploration phase is over the algorithm was selecting $3.5 cpm as the bid price over and over. Figure 1.3 depicts that out of 400 bids around 225 bids were made at $3.5. The interesting aspect of the Bayesian bandit is that once the algorithm has enough confidence about the estimates of winning probability for each bid price, it was selecting $3.5 over and over (after bid #150, bid $3.5 was selected almost every time).

Issues with the Algorithm: 

At each bid the algorithms updates the Posterior Distribution of the win probability of the corresponding bid price. As more and more bids take place and the algorithm’s Beta Posterior estimates converges, the variance of the posterior will be low and accurate predictions of the win rate will be made. Since the bidding environment is dynamic i.e. a new rtb can enter or an old rtb can leave the bidding process thereby changing the winning probability at some particular bid price, our algorithm will not be able to adopt to this change. The reason being that our posterior variance is too low and we have grown pretty much confident in our estimates.

Scenario 2: A new RTB enters the bidding process after 300 trials. By this time our algorithm had drawn win estimate for each bid price and has grown pretty confident about them but this new RTB will change the real win rate at $3.5 cpm bid. Now this bid will not be optimal choice for us.

RTB4: mean bid price = $3.5, standard deviation = $0.01, proportion of bids placed on incoming requests = 90%

Now if we bid at $3.5 our win probability will be = (1-0.3) * (1-0.5)  (1- 0.9) = 0.034, i.e. 3.4%


Figure 2.1


Figure 2.2


Figure 2.3

Figure 2.3 shows that as soon as RTB 4 enters the bidding process we stopped winning at $3.5 price. But since the algorithm will try to keep on believing that win rate is optimal for this bid price (think of it in terms of inertia or momentum) it will keep on bidding at $3.5. But with time it will lose the inertia and started selecting $4 cpm (after iteration 400). But overall this algorithm takes too much time to adapt to the new scenario.

Restless Bandit And Variance Inflation 

To tackle the issue of dynamic environment described in the last section, I experimented with a technique called variance inflation. As we saw in last experiment that the Bayesian Bandit technique was able to continuously pick the best bid price for an ad request just after around 100 overall bids. As we select the optimal bid price more often the variance of posterior beta will decrease. This means that our algorithm is learning continuously and becoming more and more confident about its choice. But as soon as the environment changed e.g. a new RTB bidding more than our optimal bidding price entered the bidding environment we will start losing all the bids. But it took us around 200 lost bids to learn that the current bid price is stale and that we should select some other bid price. In a way after losing hundreds of bid our algorithm lost confidence (the distribution changed) in the bid price it thought was optimal.

To tackle  this problem I started increasing the variance of win rate distribution associated with each bid price by 10% as soon as we have made 50 bids on that price. This means that as soon as our algorithm starts selecting one particular bid price more often for an segment we will decrease its confidence by 10 % and ask it to explore other bid price. If the environment doesn’t change then the exploration will result in wastage of resources since we will bid either higher price or lower price then the optimal price, but if the environment changes i.e. lets say some new rtb enters or leave the bidding environment or some existing rtb starts bidding higher or lower, our algorithm will grasp this change.

Simulation Scenario:

I made the appropriate implementation changes and created similar scenario as defined in last section. We have 3 RTBs bidding for a segment  with following configuration

RTB1: mean bid price = $3, standard deviation = $0.1, proportion of bids placed on incoming requests = 70% i.e. on an average this rtb bids on 70% of all requests sent by the exchange

RTB2: mean bid price = $4, standard deviation = $0.1, proportion of bids placed on incoming requests = 50%

RTB3: mean bid price = $5, standard deviation = $0.1, proportion of bids placed on incoming requests = 30%

Target Win Rate required by our test campaign = 40%

CPM = $5

Given a list of bid price choices from [$1, $1.5, $2, $2.5, $3, $3.5, $4, $4.5]  bins the aim of the algorithm is to find the bin where the win rate is closest to 40%.

Objective :  The bidding algorithm will find the optimal bid price i.e. $3.5 after some trials. From this point onwards our algorithm will bid on most of the requests at $3.5. But after 300 bids a new RTB, RTB 4 will enter the bidding domain. It will start bidding little more i.e. $3.6 than us. Now we will see compared to results posted in last section how fast will our new algorithm will adapt to this change and relearn the optimal bidding price.



Figure 3.1

Figure 3.1 shows the over all bids made on X axis and #bids made for each bid price on Y axis. We can see that before bid #300 the optimal bid price was $3.5 and it was being used most of the time. At trial #300 a new RTB enters the bidding environment. From the above plot we can see that it took the algorithm only 20-30 trials to find that optimal bid price has changed from $3.5 to $4. Henceforth $4 bid price was selected most often.


Figure 3.2


Figure 3.3

Figure 3.3 shows that the win rate for bid price $3.5 becomes nearly zero after trial #300 since RTB 4 was bidding at a higher price.


  1. Bid Landscape Forecasting in Online Ad Exchange Marketplace
  2. Real Time Bid Optimization with Smooth Budget Delivery in Online Advertising
  3. http://nbviewer.ipython.org/github/msdels/Presentations/blob/master/Multi-Armed%20Bandit%20Presentation.ipynb
  4. http://www.robots.ox.ac.uk/~parg/pubs/theses/MinLee_thesis.pdf
  6. Learning Techniques In Multi Armed Bandits 
Posted in Uncategorized | 4 Comments

Proposal : An Integrated Zero Knowledge Authentication and Public key verification protocol using Gap Diffie Hellman Groups to counter MITM attack

Introduction : The basic idea behind the proposal is to integrate authentication and key distribution protocol. Immense work has been done to design a secure key distribution protocol starting with Needham Schroeder and evolving to OAKELEY, SKEME, IKE and ultimately to Diffie Hellman Protocol and PKI. Though much work has been done in the field of Data Integrity, Non Repudiation and Secrecy, password based authentication is still the Achilles heel of cryptography.  In the presence of Byzantine Adversaries like Man In The Middle(MITM) even a combination of multi factor authentication and PKI fails.

The Gap-DH problem
Okamoto and Pointcheval formalized the gap between inverting and decisional
problem . In particular, they applied it to the Diffie-Hellman problems:
– The Inverting Diffie-Hellman Problem (C-DH) (a.k.a. the Computational
Diffie-Hellman Problem): given a triple of G elements (g, g^a, g^b), find the
element C = g^ab.
The Decision Diffie-Hellman Problem (D-DH): given a quadruple of G elements
(g, g^a, g^b, g^c), decide whether c = ab mod q or not.
The Gap Diffie-Hellman Problem (G-DH): given a triple (g, g^a, g^b), find the
element C = g^ab with the help of a Decision Diffie-Hellman Oracle (which
answers whether a given quadruple is a Diffie-Hellman quadruple or not).

Bilinear maps and pairings 

The bilinear maps used in cryptography are the Weil and Tate pairings on some
elliptic curves.

Let G and G1 be two groups of order some large prime q, denoted
multiplicatively. An admissible bilinear map is a map e : G × G ->G1 that is:
– bilinear: e(ga, hb) = e(g, h)ab for all g, h 2 G and all a, b  belongs to Z,
– non-degenerate: for g and h two generators of G, we have e(g, h) =/= 1,
– computable: there exists an efficient algorithm to compute e(g, h) for any
g, h belongs to G.
In the case of the Weil and Tate pairings, G is a subgroup of the additive group
of points of an elliptic curve E/Fp, and G1 is the multiplicative group of the
extension field Fp2 . However, in order to remain close to the identification scheme
of Schnorr, we choose to keep the multiplicative notation both for G and G1.

Proposed Protocol 

From now on, G is a group in which the Gap Diffie-Hellman problem is intractable.
We assume the existence of an admissible linear map e : G×G -> G1
and G and G1 are of order q and we denote by g a generator of G.
The prover holds public parameters

g, g^a, g^n, e(g, g), v = e(g, g)^a

and a secret S = g^a. Here public key is g^n, where n is the corresponding private key. The value v is given in the public parameters in order to withdraw its computation by the verifier. The scheme we propose is a zero-knowledge proof of knowledge of the
value g^a obtained by iterating l times the algorithm. Along with the proof of knowledge the public key of the Prover is embedded into the proof so that while verifying the secret’s knowledge, the verifier will also enter the public key of prover as a parameter and verify the authenticity of the secret as well as of the key. So from now onwards, Man In The Middle cannot just forward the prover’s proof and forward his public key to establish a secure channel. MITM will be unable to separate the key part from the proof of secret provided and thus will be incapable of carrying out attacks like active phishing where he first forwards his public key to server and then user’s credentials. By using the proposed protocol server will detect that the credentials forwarded does not belong to the particular public key supplied earlier to establish the secure channel.

Step 1: Here Prover choose a random number in the prescribed range and computer W ( a mapping on G1).  Then we introduce the public key in the picture and computer mapping of public key using the random number r. ‘r’ is only known to the prover and is created for obfuscation. The result V is sent to the verifier.

Step 2: The Verifier computes another random number c and sends it to Prover.

Step 3: The Prover uses c to computer a product Y on G, which comprises of secret, unknown r and public key computed r-1 times.

Step 4. The Verifier multiplies the resultant Y with the known public key of prover and computes its mapping. If the mapping is equivalent to the product of mapping received in step 1 and mapping of secret c times then the proof is accepted.

To prove the security of the scheme under the G-DH problem, we follow the
outline of general zero-knowledge proofs. Namely, we prove the completeness and the soundness of the scheme. Finally we prove the protocol is zero-knowledge.
During the proof, the prover and the verifier are modeled by Probabilistic
Polynomial time Turing Machines (PPTM).


If the legitimate prover and the verifier both follow the scheme, then it always
succeeds. Indeed, at each iteration, the probability of success is equal to 1 since:

e(g, Z ) = e(g, g^r × g^ac × g^nr) = e(g, g)r+ac+nr = e(g, g)^r × e(g, g)^ac ×  e(g, g) ^nr=

{e(g, g) ^nr ×  e(g, g)^r} × e(g, g)^ac = V ×

Zero Knowledge

The present identification scheme satisfies the zero-knowledge property. To simulate
in polynomial time (in |I|) the communications between a real prover and
a (not necessarily honest) verifier, we use the following algorithm M :
For the simulation of one round, M randomly picks c in [[0, 2k[[, randomly
picks Y in G and computes W = e(g, Y ) × v−1. M then sends W to the verifier
which answers ˜c. If c = ˜c then the triple (W, c, Y ) is kept, otherwise M computes method.
In average, the equality c = ˜c holds after 2^k tests. To obtain, l triples, M
constructs l×2^k triples. If l×2^k is polynomial in |I|, then we obtain a polynomial
time algorithm.


(under work )

Posted in Uncategorized | Tagged , , | 1 Comment


Download paper :


In 1976 Diffie and Hellman[7] introduced the concept of Public key cryptography. They revolutionized the world of cryptography by developing key exchange system popularly known as Diffie Hellman Key Exchange which introduces applicability of discrete log in cryptography .The purpose of the algorithm is to enable two users to securely exchange a key. It depends for its effectiveness on the difficulty of computing discrete logarithms. The discrete logarithm problem employed was defined explicitly as the problem of finding logarithms with respect to a generator in the multiplicative group of the integers modulo a prime. But this idea can be extended to other groups. Elliptic curves were introduced in cryptography in 1985 independently by Koblitz[5] and Miller[6], who proposed to use Elliptic curve as groups over which all calculations are performed. This generated more complex discrete log problems .

Other public key cryptographic systems like RSA relies on the difficulty of integers factorization. In [1] the authors discusses the issues with RSA .The 1024 bit keys may be breakable in near future. The key length for secure RSA use has increased over years due to the development of fast computing processors which provide aid for brute force computation. Now larger key length has put heavier processing load on applications using RSA.[1] also discusses the advantages of ECC over RSA . There is a sub exponential attack already developed for RSA whereas ECC has attacks developed only for few classes of the curve.

Jurisic and Menezes[2] make a comparison based study between RSA and ECC. They compared the security achieved by both methods verses the key length used .The results showed that ECC outperforms RSA .[1]Table 1 shows the comparison between RSA and ECC methods on the basis of key size used and complexity achieved.

RSA                                                ECC                                    MIPS years

key size(bits)                                   key size

512                                               106                                              10^4

768                                               132                                               10^8

1024                                             160                                               10^11

2048                                            210                                                 10^20

Here key sizes are in bits and MIPS year represents computation power of a computer executing a million instructions per second,when used for one year.

[3] describes that the advantages that elliptic curve systems have over systems based on the multiplicative group of a finite field (and also over systems based on the intractability of integer factorization) is the absence of a sub exponential-time algorithm (such as those of “index-calculus” type) that could find discrete logs in these groups ECC compared to RSA offers equal security for smaller key sizes. The result is smaller key sizes, bandwidth savings, and faster implementations, features which are especially attractive for security applications where computational power and integrated circuit space is limited, such as smart cards, PC (personal computer) cards, and wireless devices.[4] also discusses various motivating factors which describes the advantages of ECC over other cryptosystems.


Diffie and Hellman [7] used the discrete logarithm problem in their key exchange protocol .They defines the problem as finding logarithms with respect to a generator in the multiplicative group of the integers modulo a prime. But the problem of finding discrete logarithms can be extended to other groups ,especially when it is used over elliptic curves defined as curves its computational complexity increases many folds .[3] explains he discrete logarithm problem as, let G be a finite group of order n, and let α be an element of G. The discrete logarithm problem for G is the following: given an element β ∈ G, find an integer x, 0 ≤ x ≤ n − 1, such that α x = β,.Various groups have been proposed over the years for cryptographic purposes

like Agnew et al purposed multiplicative groups of characteristic two finite field .Koblitz[5] and Miller[6] used the group of points on an elliptic curve defined over a finite field.


An elliptic curve used for cryptographic purposes is defined as follows:

y ^2 mod p = (x^ 3 + ax + b) mod p -(1)

where a and b are integer constants. The set of points E (a, b) is a set ( x, y ) of all x and y satisfying the above equation.

For an elliptic curve over a finite field Z p , we use the above cubic equation in which the variables and coefficients all take on values in the set of integers from 0 and p − 1 for some prime p,in which calculations are performed modulo p .

Assume first that Fq has characteristic greater than 3. An elliptic curve E over Fq is the set of all solutions (x, y) ∈ Fq × Fq to an equation

y 2 = x 3 + ax + b,

where a, b ∈ Fq and 4a + 27b = 0, together with a special point ∞ called the point at infinity.

Addition Formulas for the Curve (1). Let P = (x1 , y1 ) ∈ E; then −P = (x1 , −y1 ). If Q = (x 2 , y2 ) ∈ E, Q = −P, then P + Q = (x3 , y3 ), where

x3 = λ^2 − x1 − x2

y3 = λ(x1 − x3 ) − y1 ,


λ=  y2-y1 /  x2-x1   if P!=Q

       3×1^2+a / 2y1   if P=Q



Let us assume that both the user of the system agreed on a generating point G on the curve .Both will select their private keys and keep is as secret.Let the private key of user A is nA and that of user B is nB. Their respective public keys are pA and pB will be calculated in the following manner



Now we will use MAP2 Group method to map a message to a point on the elliptic group .Now message

m is converted to point P. Now let a secret parameter k. User A calculates k*pB and k*G(where G is the generating point).


The adversary can access G and pB as they are in public domain. Given G and and kG is computational hard to kind k.

To decrypt it User B will multiply his private key with the hint provided as kG and subtract it from the second argument.


The total time taken is

1. Time taken to map message to group Map2Group algorithm maps message M to Fp^m as (x,y).Its most expensive operation is Square root operation. Time needed is O(m^3). If m=1 ie Fp is the desired field as in our case, then the time taken is O(1).

  1. Time taken to add two points over the elliptic curve is also a constant ie O(1).
  2. Time taken in finding the cipher text is O(k),where k is the security parameter .So total time taken is O(k)+Θ(1)

Security of ECC

The basis for the security of elliptic curve cryptosystems such as the ECDSA is the apparent intractability of the following elliptic curve discrete logarithm problem (ECDLP): given an elliptic curve E defined over Fq , a point P ∈ E(Fq ) of order n, and a point Q ∈ E(Fq ), determine the integer l, 0 ≤ l ≤ n − 1, such that Q = l P, provided that such an integer

exists.[3] have made a good discussion on various kind of possible methods developed to solve the ECDLP .Pohlig–Hellman[9] developed a algorithm that reduces the determination of l to the determination of l modulo each of the prime factors of n .Pollard ρ-method[10] is one of the best known algorithm developed to counter the ECDLP .Gallant, Lambert and Vanstone [11] developed made the previous method more efficient which takes about √( π n)/2 elliptic curve additions.

Semaev[12]–Smart[13]–Satoh–Araki[14] designed which efficiently computes isomorphism between E(F p ), where E is a prime-field-anomalous curve, and the additive group of F p . This gives a polynomial-time algorithm for the ECDLP in E(F p ) .Menezes, Okamoto and Vanstone [15] developed the famous MOV attack ,it uses the concept of weil pairing to transform the elliptic curve group to a multiplicative group,defined over field Fq k for some integer k .Due to this transformation the Elliptic curve discrete logarithm problem got reduced to finding discrete logarithm problem (DLP) in Fq k. In [16] Miller discussed the implementation of index calculus method in elliptic curve groups.

Proposed method to Increase the complexity of elliptic curve crypto system

In this paper we are proposing a new idea of enhancing the security of elliptic curve crypto system by producing two discrete log problems,each one of them has to be independently solved. In ECC the curve on which computation is to be performed is in public domain since its beneficial to keep the algorithm and components in public domain to avoid overhead .After doing some computation on this base curve ie the curve available in the public domain we are rotating it along its axis to some appropriate value. Now this curve is hidden from the adversary. To know this curve curve the adversary has to solve a Elliptic Curve Discrete Logarithm Problem(ECDLP) .After rotating this curve we map the public keys and generating point(G) from base curve to the rotated curve. This can be performed by developing a mapping function,that performs a one-one mapping between the two curves. Now all the desired group arithmetic as performed conventionally is performed on the rotated curve. The user receiving the encrypted message has to use his/her private key with a given hint provided along with encrypted message in the cipher text to find the position of the curve .

We are increasing the security by producing two discrete log problems using keys of same size as used earlier .Thereby enhancing the security level .Now the adversary using any of the above explained sub exponential time algorithm,have to apply them to two separate ECDLP problems. In the next section we are explaining the proposed technique .


An elliptic curve E over Fq is the set of all solutions (x, y) ∈ Fq × Fq to an equation

y 2 = x 3 + ax + b, (1)

where a, b ∈ Fq and 4a + 27b = 0, together with a special point ∞ called the point at infinity. Let there be two users Alice and Bob having private key and public key as (nA,pA) and (nB,pB) respectively.Suppose that E is an elliptic curve over Fq , and G(generating point) is an agreed upon (and publicly known) point on the curve .


Alice will take a secret number k1 and multiply the generating point G with it as per the group laws defined above.


Since Q lies on the curve let its coordinates be (x’,y’ ).




Rotate the axis of the curve by θ .


Alice will map public keys pA ,pB and generating point G to the rotated elliptic curve,using the mapping function F . Let us assume that the mapped images are pA’,pB’and G’.

Step 5

Now Alice will select another secret parameter k2 and compute k2*G’ and k2*pB’. Here * depicts group multiplication operation.

Step 6

Alice will map the plain text to Pm,using the Map2Group algorithm .

Cipher text generation

Now Alice sends cipher text Pm to Bob as

Cipher text Cm={k1G, k2G’,Pm+k2*pB’}

Here k1 is the parameter which Alice choses for rotation of the curve ,G is the generating point to be used on the initially given curve E1 to find E2,k2 is the number of times the addition is performed,G’ is the mapped generating point ,Pm is the plain text.


Bob will multiply his private key nB with the first hint to discover the new curve


From this data Bob will calculate the angle by which curve is rotated .Now all the calculations will be performed on the rotated curve. Bob will multiply his private key with the second argument ie k2*G’ and subtract the result form the third argument of cipher text.

{Pm+k2*pB’} -{k2G’*nB}={Pm+k2*pB’} -{k2*(G’*nB)}={Pm+k2*pB’}-{k2*pB’}=Pm


The two curves obtained will be isomorphic images of each other





αG1=α ø(G1)=αG2

Development of a one-one function

Let F be a one to one function mapping points form E1 to E2 ie F:E1->E2. Let there be a point (x,y) on E1.After the rotating the curve by θ we obtained the curve E2.Let there be another point having abscissa x’ and ordinate y’ on E2,which is the image of the point (x,y) on E1. F will find a mapping scheme between the two groups as

x’=x cosθ – y sin θ

y’=x sin θ +y cos θ

Enhancement in Security

Now the adversary has to solve two elliptic curve discrete log problems instead of one .First it is necessary for the adversary to find the rotated curve then only he can perform the desired computation on it. We are enhancing the security without increasing the key length .We can make the analysis of the security verses key length. If we keep the key length same,the security is increased two folds eg the index calculus algorithm takes sub exponential running time

exp((c + o(1))(log q k )1/3 (log log q k )2/3 )

here we assume that the elliptic curve is defined over the field Fqk

Now we have to apply the index calculus algorithm twice to solve the two ECDLP. In this way security will be increased by twofolds.

Majority of the algorithms solving the ECDLP exploit the field properties and size on which the elliptic curve is defined .As an example, we will show the security level reached by incorporating the above proposed method in comparison to the conventionally used ECC algorithm .One of the best known algorithm to solve ECDLP is Pollard p method[10]. Let us assume that over curve E is defined over the field F2^k. We will depict the field size parameter 2^k by n. Pollard p method will take ( π n)/2 steps,here a step represents elliptic curve addition. Let us denote the computing power of the computers solving the EDCLP by Pollard p algorithm in MIPS ie million instructions per second. Here MIPS year defines the computational power of a computer executing a million instruction per second utilized for one year.

Size of n ( π n)/2 MIPS year MIPS year (when proposed method is used)

(in bits) 160 2^80 8.5 * 10^11 1.7 * 10^12

186 2^93 7 * 10^15 1.4 * 10^16

234 2^117 1.2 *10^23 2.4* 10^23

Since our proposed method generates two separate ECDLP problems , the Pollard p method will have to be applied to both the problems independently.

Implementation efficiency

ECC has major application in wireless communication. The wireless devices run on battery power,so the limited energy provided by battery acts as a constraint on the processing capabilities .If we use methods like RSA ,yet they provide high level of security but they will consume the power of the wireless devices

at both ends as both encryption and decryption processes will take enough time and energy . ECC overtook RSA in this respect ,as we can use keys of lower size with respect to RSA,and still provide same level of security. ECC takes less time to implement as arithmetic performed ie point addition and point doubling over the elliptic curve are relatively fast when compared to arithmetic performed over other fields.

[3] made a comparison between the time taken by ECC arithmetic performed over by elliptic curve defined over field Fq,whose order is 160 bits with DSA arithmetic performed over field Fp where calculations are performed with a 1024 bit modulus p .Since they both provide same level of security we are comparing the time taken to implement them. Multiplying two n bit numbers takes n^2 bit operation ,so modular multiplication is (1024/160)^2 ≈ 41 times longer than a field multiplication.

  1. To calculate kP ,where P∈ E(Fq ) and k is a 160 bit integer, we have to do repeated doubling and addition which on average requires 160 elliptic curve doubling and 80 elliptic curve additions

.The time taken to perform Elliptic curve addition or doubling requires one field inversion and two field multiplication[17].Also [18] and [19] showed that one field inversion is equivalent to three field multiplications .[3] showed that computing k P requires the equivalent of 1200 field multiplications, or 1200/41 ≈ 29 1024-bit modular multiplications.

  1. To calculate a^k mod p ,where k is 160 bit random number and p is 1024 bit prime (as perfomed in DSA) by repeated squaring and multiplying requires an average of 240 1024-bit modular multiplications

[3]. It proves that arithmetic operations performed on elliptic curve defined over Fq are nearly 8 times faster then operations performed in case 2.

Let us assume a random number k of 32 bits .As explained earlier in cipher text,


In this mechanism we have to make three calculations

1.multiplication of G with k

2.multiplication of pB with k.

3.one addition operation

Both 1 and 2 , will require 240 field multiplications. So in total we have 480 field multiplications. As explained earlier one addition is equivalent to five field multiplications. So the total time taken in producing the cipher text is equivalent of doing 485 field multiplications.

Our proposed method may seem to be more time consuming at first look , but if we select the numbers k1 and k2 prudently then we can perform the above calculation in the same time .In our method cipher text,

Cm={k1G, k2G’,Pm+k2*pB’}

Let us assume that k1 and k2 are 32 bit and 16 bit numbers respectively .Now , the four operations performed are

1.multiplication of k1 with G.

2.multiplication of k2 with G’.

3.multiplication of k2 with pB’

4.addition between Pm and k2*pB’

Operation 1 is equivalent of doing 240 field multiplications. While operations 2 and 3 are equivalent of doing 120 field multiplications each. The fourth addition operation is equivalent of performing one field inversion and three field multiplication , which sums up to five field multiplications .In this way the total time taken is equivalent to 485 field multiplications ,which is exactly equal to the earlier case.

Thus our proposed method takes the same time in implementation as taken by previous method .We can still use it in wireless communication .On the other hand it provides more security than the conventional elliptic curve cryptographic system.


In this paper we have introduced a technique called Mutarotation and presented the algorithm to implement it. We also made a comparison on the basis of security achieved and implementation time between our proposed method and previously incorporated technique. We have proved that our proposed technique is more secure then the previously used method and it takes same time in implementation. This method can provide more security to wireless communication and can be implemented in same time as previous algorithm.


1.Elliptic Curve Cryptography by Vivek Kapoor and Vivek Sonny Abraham Department of Computer Engineering Delhi college of Engineering .

2.Elliptic curves and cryptography by Aleksandar Jurisic and Alfred J. Menezes . .Dr. Dobb’s Journal, 1997.

3.The State of Elliptic Curve Cryptography Neal koblitz Dept. of Mathematics, Box 354350, University of Washington, Seattle, WA 98195, USA.,ALFRED MENEZES Dept. of C&O, University of Waterloo, Waterloo, Ontario, Canada, N2L 3G1. SCOTT VANSTONE Dept. of C&O, University of Waterloo, Waterloo, Ontario, Canada, N2L 3G1.)

4.New Vistas in elliptic curve cryptography Brian A. LaMacchia, John L. Manferdelli Microsoft Corporation, One Microsoft Way, Redmond, WA, USA

5.N. Koblitz, Elliptic curve cryptosystems, Mathematics of Computation, Vol. 48 (1987) pp. 203–209.

6.V. Miller, Uses of elliptic curves in cryptography, Advances in Cryptology—CRYPTO ’85, Lecture Notes in Computer Science, Springer-Verlag, 218 (1986) pp. 417–426.

7.W. Diffie and M. Hellman, New directions in cryptography, IEEE Transactions on Information Theory, Vol. 22 (1976) pp. 644–654.

8. G. Agnew, R. Mullin, I. Onyszchuk and S. Vanstone, An implementation for a fast public-key cryptosystem,

Journal of Cryptology, Vol. 3 (1991) pp. 63–79.

9. . Pohlig and M. Hellman, An improved algorithm for computing logarithms over G F( p) and its crypto-

graphic significance, IEEE Transactions on Information Theory, Vol. 24 (1978) pp. 106–110.

10. J. Pollard, Monte Carlo methods for index computation mod p, Mathematics of Computation, Vol. 32 (1978)

pp. 918–924.

11. R. Gallant, R. Lambert and S. Vanstone, Improving the parallelized Pollard lambda search on binary anoma-

lous curves, to appear in Mathematics of Computation.

12.I. Semaev, Evaluation of discrete logarithms in a group of p-torsion points of an elliptic curve in characteristic

p, Mathematics of Computation, Vol. 67 (1998) pp. 353–356.

13.N. Smart, The discrete logarithm problem on elliptic curves of trace one, to appear in Journal of Cryptology.

14.T. Satoh and K. Araki, Fermat quotients and the polynomial time discrete log algorithm for anomalous elliptic

curves, Commentarii Mathematici Universitatis Sancti Pauli, Vol. 47 (1998) pp. 81–92.

15.A. Menezes, T. Okamoto and S. Vanstone, Reducing elliptic curve logarithms to logarithms in a finite field,

IEEE Transactions on Information Theory, Vol. 39 (1993) pp. 1639–1646.

  1. Miller, Uses of elliptic curves in cryptography, Advances in Cryptology—CRYPTO ’85, Lecture Notes in Computer Science, Springer-Verlag, 218 (1986) pp. 417–426.

17.G. Agnew, R. Mullin, I. Onyszchuk and S. Vanstone, An implementation for a fast public-key cryptosystem ,

Journal of Cryptology.

18.R. Schroeppel, H. Orman, S. O’Malley and O. Spatscheck, Fast key exchange with elliptic curve systems,

Advances in Cryptology—CRYPTO ’95


19.E. De Win, A. Bosselaers, S. Vandenberghe, P. De Gersem and J. Vandewalle, A fast software implementation

for arithmetic operations in G F(2n ), Advances in Cryptology—ASIACRYPT ’96

Posted in Uncategorized | Leave a comment