John Ray Martinez

LLM Agents Mastery

2025-02-06T08:50:00+00:00

UC Berkeley RDI Certification (2024).

Description. John has successfully completed and earned the LLM Agents Mastery certification from UC Berkeley Center for Responsible, Decentralized Intelligence (RDI). This certification validates foundation of LLMs, essential LLM abilities required for task automation, as well as infrastructures for agent development.

The certificate can be downloaded here.

AWS Certified Machine Learning - Specialty

2023-05-31T08:50:00+00:00

AWS Training and Certification (2023).

Description. John has successfully passed and earned the AWS Certified Machine Learning - Specialty certification from Amazon Web Services Training and Certification. This certification validates expertise in building, training, tuning, and deploying machine learning (ML) models on AWS.

The AWS Certified Machine Learning - Specialty exam has the following main content domains:

Data Engineering
Exploratory Data Analysis
Modeling
Machine Learning Implementation and Operations

The certificate can be downloaded here.

Outstanding Graduate Student Award Recipient

2021-06-11T08:50:00+00:00

Drexel University College of Computing & Informatics (CCI) (2021).

Description. John is a recipient of Drexel University 2021 College of Computing & Informatics (CCI) Outstanding Graduate Student Award. The CCI Awards are the most prestigious honor given within the College community recognizing excellence, achievement, leadership, and innovation. As part of the award, John has received prize payment and was recognized during CCI Honors event.

The certificate can be downloaded here.

Multimodal Brain Tumor Segmentation using Convolutional Neural Network

2021-03-13T08:50:00+00:00

AUTHORS

John Ray Martinez (jbm332@drexel.edu), Jonathan Musni (jem472@drexel.edu), Marvin Joseph Occeño (mr048@drexel.edu), Edmar Parreño (erp75@drexel.edu), and Juan Miguel Trinidad (jbt46@drexel.edu)

_{This capstone project was selected for oral research presentation at the 2021 Drexel Emerging Graduate Scholars Conference}

Abstract. Segmentation is the process of examining brain images such as magnetic resonance imaging (MRI) images and computed tomography (CT) scans to locate regions of interests (ROI). These regions define the boundaries of the brain tumor. Undergoing this process allows radiologist or medical personnel to distinguish between healthy cells and tumors. However, with manual segmentation, radiologists take a lot of time and labor to properly segment the images with high accuracy. Human error is also inevitable. With limitations in human capacity, manual segmentation can inhibit diagnosis and therefore delay treatment. Convolution neural network (CNN) models are popular methods in image processing and have rapidly developed as a powerful tool in computer vision and pattern recognition. U-Net architecture is a well-known variant of CNN that consists of contracting (convolution) path and expanding (deconvolution) path to be trained in an image-segmentation map. By utilizing the BraTS 2020 (Brain Tumor Segmentation 2020) from the University of Pennsylvania Center for Biomedical Image Computing & Analytics (CBICA), we are able to investigate the preponderance of this specific architecture. We show that, through hyperparameter tuning of kernel size from 3 to 2 in deconvolution path, sensitivity of Necrotic region has significant improvement while a poor performance in Enhancing region. In this study, we investigate Convolutional Neural Network (CNN)-based architectures such as U-Net and PSPNet. It is found that U-Net outperforms the PSPNet in correctly segmenting the brain tumors. This study’s findings can inform the design and development of an automatic brain tumor segmentation system.

Full paper can be requested.

The Impact of driver distraction and secondary tasks with and without other co-occurring driving behaviors on the level of road traffic crashes

2021-02-17T23:14:36+00:00

Ali Jazayeri, John Ray Martinez, Helen Loeb, and Christopher C. Yang
Accident Analysis & Prevention 153 (2021) 106010.
https://doi.org/10.1016/j.aap.2021.106010

Abstract. Driving safety is typically affected by concurrent non-driving tasks. These activities might negatively impact the trips’ outcome and cause near-crash or crash incidents and accidents. The crashes impose a tremendous social and economic cost to society and might affect the involving individuals’ quality of life. As it stands, road injuries are ranked among top-ten leading causes of death by the World Health Organization. Distracted driving is defined as an attention diversion of the driver toward a competing activity. It was shown in numerous studies that distracted driving increase the probability of near-crash or crash events. By leveraging the statistical power of the large SHRP2 naturalistic data, we are able to quantify the preponderance of specific distractions during daily trips and confirm the causality factor of an ubiquitous non-driving task in the crash event. We show that, except for phone usage which happens more frequently in near-crash and crash categories than in baseline trips, both distracted driving and secondary tasks occur almost uniformly in different types of trips. In this study, we investigate the impact of the co-occurrence of distracted driving with other driving behaviors and secondary tasks. It is found that the co-occurrence of distracted driving with other driving behaviors or secondary tasks increase the chance of near-crash and crash events. This study’s findings can inform the design and development of more precise and reliable driving assistance and warning systems.

Full paper can be downloaded here

Predicting Multiple Time Series with USA COVID-19 data using Machine Learning models

2020-09-03T08:50:00+00:00

AUTHOR

John Ray Martinez (jbm332@drexel.edu)

_{This research is implemented in fulfillment of the requirements for the Applied Machine Learning Course of Master of Science in Data Science under Drexel University College of Computing & Informatics}

Introduction

As novel coronavirus COVID-19 cases surge across the US, improving methods for prediction of COVID-19 cases in this country is extremely important. It is imperative that the hotspot is studied more thoroughly to slow down the outbreak while trying to find a cure and vaccine. Forecasting the time of future surge would minimize the impact of COVID-19 by taking timely preventive steps including public health early response such as lockdown, schools closures, and travel restrictions.

Therefore, accurate COVID-19 transmission rate forecasting is essential to better understand the current situation and plan for the future. This is also for public health authorities to implement interventions effectively in controlling the outbreaks. This would greatly minimize the social and economic impact of the disease.

Furthermore, the objective of this study is to determine whether it is possible to use one Machine Learning model on multiple time series data (COVID-19 new daily confirmed cases for each state) to project future COVID-19 confirmed cases.

DATA DESCRIPTION

The dataset was obtained from 2019 Novel Coronavirus COVID-19 (2019-nCoV) Data Repository by Johns Hopkins CSSE. In addition, data from US States population data of 2019 (NST-EST2019-alldata) was obtained from United States Census Bureau.

The time series data is split into training set (01/22/2020 - 06/30/2020), validation set (07/01/2020 - 07/31/2020), and test set (08/01/2020 - 08/22/2020).

METHODOLOGY

Preprocessing includes feature engineering, data merging, feature selection, log and polynomial transformation, and categorical encoding. The performance of the following machine learning models is compared.

Linear Regressor
RandomForest Regressor
Gradient Boost Regressor
XGBoost Regressor

The general workflow for comparing the performance of machine learning models as shown in Figure 1 involves the following steps.

Preprocessing of data
Recursive forecasting using rolling window with 1-day ahead
Use of each model and evaluation using R-squared, Root Mean Square Error and Mean Absolute Error
Comparison of models via R-squared, RMSE and MAE, plotting test results of all models, and plotting of residual results of all models

Figure 1. Workflow for measuring the performance of Machine Learning models.

RESULTS

The metric evaluations show that XGBoost had the best results in terms of predicting the test dataset. On the other hand, Gradient Boosting outperformed all the models in training dataset. Furthermore, XGBoost has the best results among the models while showing its capability to do parallel processing as it clocked the fastest elapsed time.

Table 1. Statistics for prediction of COVID-19 daily confirmed cases.

Plotting of test and residual results of all models in states which attributed the highest number of confirmed cases are shown below.

A. Texas

Figure 2. Texas prediction plot over the test set.

Figure 3. Texas residual plot over the test set.

B. Florida

Figure 4. Florida prediction plot over the test set.

Figure 5. Florida residual plot over the test set.

C. California

Figure 6. California prediction plot over the test set.

Figure 7. California rsidual plot over the test set.

With the recent trend and issue of the coronavirus, the researcher hopes that this project will be relevant and helpful to other researchers, specialists, and especially public health surveillance systems. These people would benefit by knowing when to focus on this sickness and when the surge or resurgence would happen. Furthermore, it would help the public health authorities and governments alike in decision making whether to ease the lockdown or issue another lockdown for each state.

LIMITATIONS

First, the method does not include different control measures for each state such as level of social distancing and how these will change in the future. In addition, the available period of time is about half a year while considering around 50 US states which will result to approximately 11,000 data points; a relatively small data set. Moreover, despite focusing on US, the study does not include mapping of the virus’ trend to specific areas or regions in the country.

INSIGHTS AND CONCLUSION

The estimated values of COVID-19 daily confirmed cases were in good agreement with their related observed values and the used Machine Learning models, especially the ensemble method – Boosting models, could be used to forecast daily confirmed cases. These results are very worthwhile for the decision-making bodies or public health experts given that the decision is urgent.

For future work, merging more domain-related data like temperature, lockdown periods, etc, that have significant impacts in variation of number of COVID19 cases can be considered. In addition, the direct forecasting approach discussed by Souhaib Ben Taieb in his dissertation paper can be explored.

I AM SAM: An Automatic Text Summarization System using different Extractive Techniques

2020-08-27T08:50:00+00:00

AUTHORS

John Ray Martinez (jbm332@drexel.edu), Jonathan Musni (jem472@drexel.edu), Juan Miguel Trinidad (jbt46@drexel.edu)

_{This research is implemented in fulfillment of the requirements for the Information Retrieval Systems Course of Master of Science in Data Science under Drexel University College of Computing & Informatics}

INTRODUCTION

In recent years, there has been a growth of large volume of text data from a variety of sources. This explosion of amount of text data led to the problem of information overload. The generation today called ‘Net generation’ learns through multitasking, performing activities simultaneously, and has short attention span. ‘Net generation’ can perform more tasks simultaneously and shift their attentions quickly from one to another, but would probably be overwhelmed if they are asked to read a long report. Thus, more educators motivate them to engage in the learning content by supplying shorter contents in the curricula [1]. To alleviate information overload and considering the characteristic of the ‘Net generation’, the need for automatic text summarization is deemed necessary.

One tool for text summarization is Python package sumy [2]. It has three most notably used models namely LSA (latent semantic analysis), LexRank and Luhn. LSA is an unsupervised method of summarization that combines term frequency techniques with singular value decomposition to summarize texts. Also an unsupervised approach, LexRank is a graphical based text summarizer inspired by algorithms PageRank. Meanwhile, Luhn is a naive approach based on TF-IDF. It scores sentences based on frequency of the most important words and also assigns higher weights to sentences occurring near the beginning of a document [3]. In this study, we investigate and evaluate the application of sumy models on the extractive summarization task using news articles and show that the results obtained with LSA are competitive with other two algorithms developed.

Furthermore, utilizing the sumy extractive summarization techniques, we build and implement a web application on Heroku that mainly functions as text summarizer.

EXPERIMENTS

DATA DESCRIPTION

The dataset is approximately 2225 documents from the BBC news website and represented into five topical areas such as business, entertainment, politics, sport, and technology [4]. This dataset for extractive text summarization has 510 business news articles of BBC from 2004 to 2005. For each article, one summary are provided in the Summaries folder. In this study, the first 100 pairs of business news articles and its correponding reference summaries were manually selected and used. The extractive summary articles will be used as reference summaries (gold standard) for evaluating the system summaries using ROUGE.

METHODOLOGY

We applied the three sumy methods in the sampled business news articles. All these algorithms extract six sentences from each article in order to compose the summary. We performed an experimental comparison with three extractive summarization techniques. The performance of each summarization technique was evaluated by using variants of the ROUGE measure [5]. This performance metrics is a method based on Ngram statistics and found to be highly correlated with human evaluations [6]. Concretely, Rouge-N with unigrams and bigrams (Rouge1 and Rouge-2) and Rouge-L. Each Rouge has corresponding F1, precision, recall scores. First, the value of the evaluation measure was calculated for each of the article. Next, we took average of those scores to arrive at a consolidated Recall and F1 scores for each Rouge. Algorithm 1 shows the pesudo-code of the implementation of the method in this study.

EXPERIMENTAL RESULTS AND DISCUSSION

We evaluate the four summarization techniques on a single-document summarization task using 100 news articles from business section of BBC dataset. For this task to have a meaningful evaluation, we report ROUGE Recall as standard evaluation and take output length into account [7]. For each article, each summarizer generate a six-sentences summary. The corresponding 100 human-created reference summaries are provided by BBC and used in the evaluation process. We compare the performance of the three different summarizing techniques with each other.

Table 1 shows the results obtained on this data set of 100 news articles, including the results for LSA, and the results of the other two sumy summarizers in the single document summarization task. LSA summarization technique succeeds in summarization task on news articles followed by sumy-LexRank then sumy-Luhn.

Table 1. The average Recall (F1) of test set results on the BBC business news articles dataset using granularity of text metrics ROUGE-1, ROUGE-2 and ROUGE-L.

Model	ROUGE-1	ROUGE-2	ROUGE-L
`LSA`	0.867 (0.059)	0.617 (0.041)	0.841 (0.082)
`Luhn`	0.794 (0.052)	0.407 (0.031)	0.612 (0.045)
`LexRank`	0.844 (0.072)	0.576 (0.049)	0.807 (0.096)

Figure 1 visualizes the comparison of models using Rouge Recall as performance metrics. As shown, LSA has the best performance in extractive summarization task on business news articles.

Figure 1: ROUGE performance of algorithms.

IMPLEMENTED SYSTEM

In this section, we presented the overall architecture to implement the system and discussed the major system features.

SYSTEM ARCHITECTURE

The overall architecture of web application for single document summarization based on news components using sumy models LSA, Luhn and LexRank is shown in Figure 2. The three main phases include the back-end, front-end, and deployment. To create a web application, we utilized Flask which is a micro web framework written in Python. For layout to look good, we styled it with Boostrap. Finally, we deployed the models on Heroku. Figure 3 presented visually the detailed diagram of system features.

Figure 2: System architecture overview.

SYSTEM FEATURES

I AM SAM web app alleviates information overload by distilling important information using machine learning algorithms. It is an assistant that helps the users manage time by providing text summary in seconds. In this project, we focus on plain text and URL as inputs. More specifically, we consider the following features:

Figure 3: System Features.

Target Length

Target length is the number of sentences in the text summary. This feature lets the user input the prefer length of summary in terms of number of sentences in the output.

Inputs

There are two possible inputs: plain text articles or URL that contains the text articles. Figure 3 visualizes the diagam flow for these two options.

Check Mode

Feature ‘check mode’ gives the user a choice to put up a reference summary. There is a ON and OFF toggle for this feature. It provides the user to check how good the provided summaries with respect to the reference summaries. This also shows the calculated Rouges (F1, precision, recall) scores for each model text summary.

Best Summary Generator

Once the check mode is ON, the system compares the summary output of each algorithm and output the best model based on Rouge Recall metric

CONCLUSION AND FUTURE WORKS

We presented Python package sumy implementation of LSA (latent semantic analysis) outperforming the other models such as LexRank and Luhn in extractive summarization task using BBC business news articles. Furthermore, we implemented an automatic text summarization system called I AM SAM through Heroku that has capability to summarize news article from a URL or plain text utilizing the three sumy extractive summarization techniques.

As future work, we plan to extend the averaging algorithm to all articles in business news articles folder which has a total of 510 articles. In addition, we will explore other news articles such as entertainment, politics, sport, and technology.

REFERENCES

[1] D. G. Oblinger and J. L. Oblinger. 2005. In Educating the net generation. Educause. Retrieved August, 19, 2020 from https://www.educause.edu/ir/library/PDF/pub7101.PDF.

[2] Mišo Belica. 2020. Module for automatic summarization of text documents and HTML pages. https://github.com/miso-belica/sumy.

[3] Mišo Belica. 2020. Summarization methods. https://github.com/miso-belica/sumy/blob/master/docs/summarizators.md.

[4] Derek Greene and Pádraig Cunningham. 2006. Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering. In Proc. 23rd International Conference on Machine learning (ICML’06). ACM Press, 377–384.

[5] C.Y. Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop. Association for Computational Linguistics: Barcelona, Spain, 74–81.

[6] C.Y Lin and E.H. Hovy. 2003. Automatic evaluation of summaries using n-gram co-occurrence statistics. In In Proceedings of Human Language Technology Conference (HLT-NAACL 2003). Association for Computational Linguistics: Edmonton, Canada.

[7] Benjamin Van Durme Courtney Napoles and ChrisCallison-Burch. 2011. Evaluating sentence com-pression: Pitfalls and suggested remedies. In Proceedings of the Workshop on Monolingual Text-To-Text Generation. Association for Computational Linguistics: Portland, Oregon, 91–97.

Identifying Co-occurrence Based on Hours Played for Video Games

2020-06-12T08:50:00+00:00

AUTHORS

John Ray Martinez (jbm332@drexel.edu), Jonathan Musni (jem472@drexel.edu), Marvin Joseph Occeno (mr048@drexel.edu)

_{This research is implemented in fulfillment of the requirements for the Data Mining Course of Master of Science in Data Science under Drexel University College of Computing & Informatics}

INTRODUCTION

Playing video games has always been a popular leisure activity. Recently, in light of the pandemic, people are actually encouraged to play such video games to ensure that they do stay at home [1]. To let gamers keep on playing more, recommendation engines are being utilized by several online video game stores. Players receive various game suggestions which are usually based on, but not limited to, their gaming history [2]. In this project, we create a game-based recommender system using association rules mining with respect to the video games that were frequently played together. The objectives of this study are to: i) identify the most played video games; ii) identify the frequent co-occurring video games; and iii) provide recommendations based on correlated video games.

DATA DESCRIPTION

Steam, the largest digital distribution platform for PC gaming, has 6000 games and a community of millions of gamers. One study shows that searchability is one of the reasons why Steam is growing so rapidly [3]. Moreover, it has experienced explosive growth in 2018. This platform attracted a lot of companies to source out their data. Tamber, an analytics service company, was able to manually crawl the data from the Steam API in 2017.

As per Kaggle documentation [4], the dataset which is approximately nine megabytes of data is represented into the following columns:

Table 1. Sample Filtered Dataset.

User Id	Games Played	Number of Hours Played
`298950`	ARK Survival Evolved	41.0
`76767`	Call of Duty Modern Warfare 2	65.0
`76767`	Banished	24.0
`229911`	Call of Duty Modern Warfare 2	44.0
`86540`	Audiosurf	57.0

METHODOLOGY

We transformed the dataset into a matrix of 1s and 0s. The columns of the new dataframe represent the video games, whereas the rows represent the players. A table cell is set to 1 if a user has played the game for more than or equal to the median number of hours played; otherwise, its value is 0 as shown in Table 2. We utilized the Python library MLxtend to automatically perform the Apriori principle to determine the frequent itemsets. The library also generated association rules given these itemsets where the pattern evaluation metrics like support, confidence, and lift are listed. Based on this discretization, we generated association rules and built a recommender system.

Table 2. Sample Transformed Dataset.

User Id	ARK Survival Evolved	Audiosurf	Banished	BioShock Infinite	Borderlands 2	Call of Duty Black Ops	Call of Duty Modern Warfare 2	Call of Duty Modern Warfare 2 - Multiplayer	Call of Duty World at War
`5250`	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
`76767`	0.0	0.0	1.0	0.0	0.0	1.0	1.0	1.0	1.0
`86540`	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
`229911`	0.0	0.0	0.0	0.0	0.0	0.0	1.0	1.0	0.0
`298950`	1.0	0.0	0.0	1.0	1.0	0.0	0.0	0.0	0.0

Quantitative Association Rules

Here comes the fun part - finding association rules. The first step is to determine frequent itemsets. Since the data is relatively large, we have decided to set the minimum support to 0.005.

Table 3. Itemsets with minimum support of 0.005.

support	itemsets
`0.006799`	(7 Days to Die)
`0.009713`	(APB Reloaded)
`0.010962`	(ARK Survival Evolved)
`0.009852`	(AdVenture Capitalist)
`0.014153`	(Age of Empires II HD Edition)
`...`	...
`0.006660`	(Left 4 Dead 2, The Elder Scrolls V Skyrim, Te...
`0.006244`	(Unturned, Left 4 Dead 2, Team Fortress 2)
`0.007354`	(Unturned, Robocraft, Team Fortress 2)
`0.005828`	(Unturned, Terraria, Team Fortress 2)
`0.006244`	(Team Fortress 2, Unturned, Garry's Mod, Count...

Based on the 454 itemsets that passed the minimum support (0.005), we determined the most frequent k-itemsets below.

Table 4. Top 5 Frequent 1-itemsets.

Itemset	Support
`Dota 2`	0.3368
`Team Fortress 2`	0.1618
`Counter-Strike Global Offensive`	0.0956
`Unturned`	0.0744
`Left 4 Dead 2`	0.0561

Table 5. Top 5 Frequent 2-itemsets.

Itemset	Support
`Dota 2, Team Fortress 2`	0.0336
`Counter-Strike Global Offensive, Dota 2`	0.0319
`Counter-Strike Global Offensive, Team Fortress 2`	0.0309
`Team Fortress 2, Unturned`	0.0279
`Left 4 Dead 2, Team Fortress 2`	0.0272

Table 6. Top 5 Frequent 3-itemsets.

Itemset	Support
`Counter-Strike Global Offensive, Dota 2, Team Fortress 2`	0.0132
`Counter-Strike Global Offensive, Garry's Mod, Team Fortress 2`	0.0115
`Counter-Strike Global Offensive, Team Fortress 2, Unturned`	0.0115
`Garry's Mod, Team Fortress 2, Unturned`	0.0115
`Counter-Strike Global Offensive, Left 4 Dead 2, Team Fortress 2`	0.0115

The most frequent 1-itemset is Dota 2 as shown in Table 4. It dominates the Steam gaming world. The results for frequent 2-itemsets and 3-itemsets in Table 5 and Table 6 respectively are not necessarily interesting since it is quite expected that popular games such as Dota 2 and Team Fortress 2 would co-occur more than others. Hence, we did some research on relative co-occurrence analysis and found a metric called all-confidence [5], which is equal to the equation 1 below.

               all-confidence(X⇒Y) =  support(X⇒Y) / max(support(X), support(Y))       (1)

If the all-confidence is equal to 1, then itemsets X and Y always co-occur relatively. This is equivalent to saying that both confidence(X⇒Y) and confidence(Y⇒X) are equal to 1.

Since MLxtend does not compute the all-confidence metric, we implemented a function and found the frequent 2-itemsets shown in Table 7 in next section.

RESULTS

As observed from Table 7, the popular game Dota 2 is nowhere to be found in the Top 5. This is because the frequency of 2-itemsets has been computed on a relative basis (all-confidence).

Table 7. Top 5 Frequent 2-itemsets (Based on All-Confidence).

Itemset	All-Confidence
`Half-Life 2 Episode One, Half-Life 2 Episode Two`	0.6410
`Call of Duty Modern Warfare 3, Call of Duty Modern Warfare 3 - Multiplayer`	0.5755
`Call of Duty Modern Warfare 2, Call of Duty Modern Warfare 2 - Multiplayer`	0.5436
`Call of Duty Black Ops, Call of Duty Black Ops - Multiplayer`	0.4912
`Total War ROME II - Emperor Edition, Total War SHOGUN 2`	0.4444

Speaking of all-confidence, this metric seems to be reliable enough for co-occurrence analysis since the results above are quite sensible. For instance, Half-Life 2 Episode One and Half-Life 2 Episode Two are shown to co-occur frequently despite not being popular as shown in Table 7. Looking at their titles, one is probably a sequel of the other. This means that players are highly interested in completing the game series since the co-occurrence is relatively high. In addition, Call of Duty Modern Warfare 3 and Call of Duty Modern Warfare 3 - Multiplayer appear to co-occur frequently as well. This makes sense since these games are actually related to each other content-wise, not to mention how similar their game titles are. The key difference of these two games is that the former has a single-player mechanics while the latter is multiplayer-oriented which requires interaction with other players. Hence, it seems that many of those who played the single player campaign also wanted to try the multiplayer mode, and vice-versa.

To avoid the limitation of the support-confidence framework (i.e., high support and high confidence could happen by chance), we primarily use the evaluation metric ‘lift’ to find more meaningful associations. The idea is to provide recommendations based on strongly correlated video games.

Sorted based on highest ‘lift’, the generated rules are consistent with our results earlier. Half-Life 2 Episode One and Half-Life 2 Episode Two are part of the top list. Having a huge lift of 65.07, these two games indeed have strong, positive correlation. Moreover, it is expected that Call of Duty Modern Warfare 3 and Call of Duty Modern Warfare 3 - Multiplayer are also strongly correlated with a lift of 41.47.

Table 8. Sample of Interesting Rules.

A	B	A Support	B Support	support(A⇒B)	confidence(A⇒B)	lift	leverage	conviction
`(The Elder Scrolls V Skyrim)`	(Fallout 4)	0.047315	0.011794	0.005273	0.111437	9.448542	0.004715	1.112139
`(The Elder Scrolls V Skyrim)`	(BioShock Infinite)	0.047315	0.014847	0.006244	0.131965	8.888508	0.005541	1.134923

One of the interesting rules that we find cool enough is that The Elder Scrolls V Skyrim and Fallout 4 are highly correlated with a lift of 9.45 as shown in Table 8. Amazingly, we have found out that bundles of these two games are currently being sold not only on Steam but also on the PlayStation Store. It seems that Steam and PlayStation are aware that players frequently played these two games together before, and that’s why these games are recently being sold as a bundle for both PC and PS4 console gaming.

From the same table, another interesting rule indicates that The Elder Scrolls V Skyrim and BioShock Infinite are also highly correlated with a lift of 8.89. We have found out that a bundle of these two games, for PlayStation 3 this time, are actually being sold on Amazon.com. Again, this means that these two games might really have a strong association and that’s why they’re being sold as part of a bundle.

DISCUSSION AND FUTURE WORKS

In this study, after trying multiple thresholds, we learned that the threshold of 0.5% provides acceptable interpretability of results. We use this support threshold to consider a set of frequent video games played. It means a game set is deemed to be frequent if this set is observed in at least 0.5% of the data set of video games. Please note that the a frequent set should have at least one game while the number of games in the set cannot exceed three.

The result of the adopted approach can be summarized as follows. We show that Dota 2 is the most frequent played game. Although not as frequent as Dota 2, Team Fortress 2 is the second game observed. However, the impacts of co-occurrences of these two games are observed in frequent 2-itemsets and 3-itemsets results. Considering this observation that the popular games will always co-occur more than the others, we adopt a good strategy to perform other interesting measure ‘all confidence’. As a result, popular games are not in the top games co-occurrences and the most common pair of co-occurring games is composed of Half-Life 2 Episode One and Half-Life 2 Episode Two which are obviously associated to each other being a prequel-sequel pair.

Furthermore, we focused on the video games co-occurrence analysis in a data set to come up with video games recommendation system. With the help of library MLxtend, we built the system that primarily use the built-in interesting measure ‘lift’ to recommend associated games. This system is able to identify meaningful associations from our co-occurrence analysis and provide games recommendations to the user. Interestingly, some of its association rules found are exactly the same game pairs sold as a bundle for both PC and PS4 console gaming. However, we are dealing with a huge number of games which is potential for large number of null transactions and the built-in interesting measure ‘lift’ is a not null-invariant measure. This means that ‘lift’ can be easily affected by null transactions giving a chance for the system to provide the user with a bad recommendation.

With this, in the next study, we are going to focus on tasks of comparing different null-invariant measures such as ‘all confidence’, ‘Kulczynski’, etc. Since there is no such readily-made functions or modules that implement those measures, we need to create ones and integrate to the recommendation system. It will be interesting to see the the impacts of null-invariant measures along with not null-invariant to the association rules.

Moreover, another avenue of future research that we would like to explore is the predictive power of individual and co-occurring frequent video games for the category of games. This idea is based on this fact that some of the video games are frequent in a specific category of games.

REFERENCES

[1] M. Snider, Video games can be a healthy social pastime during coronavirus pandemic, USA Today, March 29, 2020. [Online]. Available: https://www.usatoday.com/story/tech/gaming/2020/03/28/videogames-whos-prescription-solace-during-coronaviruspandemic/2932976001/. [Accessed May 7, 2020].

[2] P. Bertens, A. Guitart, P. P. Chen and A. Perianez, A Machine-Learning Item Recommendation System for Video Games, 2018 IEEE Conference on Computational Intelligence and Games (CIG), Maastricht, 2018, pp. 1-4, doi: 10.1109/CIG.2018.8490456.

[3] O’Neill, M., Vaziripour, E., Wu, J., Zappala, D.: Condensing steam: distilling the diversity of gamer behavior. In Proceedings of the 2016 Internet Measurement Conference, IMC 2016, pp. 81-95. ACM, New York (2016). https://doi.org/10.1145/2987443.2987489.

[4] Tamber Team. (2017, March). Steam Video Games. Retrieved May 10, 2020 from https://www.kaggle.com/tamber/steam-video-games/.

[5] D. J. Prajapati, S. Garg and N. C. Chauhan,” Interesting association rule mining with consistent and inconsistent rule detection from big sales data in distributed environment, “ Future Computing and Informatics Journal, pp. 1-12, 2017.

Applied Data Science with Python

2019-02-06T08:50:00+00:00

University of Michigan (2019).

Description. John has successfully completed a collection of Data Science courses from University of Michigan in Coursera through the project jointly organized by DOST-PCIEERD and moocs.ph. This was a 12-month online DOST-PCIEERD sponsored Data Science Courses in cooperation with Coursera and moocs.ph.

The Applied Data Science with Python specialization has the following modules:

Using Python to Access Web Data (download here)
Using Databases with Python (download here)
Introduction to Data Science in Python
Applied Plotting, Charting and Data Representation in Python
Applied Machine Learning in Python
Applied Text Mining in Python
Applied Social Network Analysis in Python

The certificate for the collection of courses can be downloaded here.

Effect of a Linear Potential on the Temporal Diffraction of Particle in a Box

2009-10-30T06:14:36+00:00

John Ray Martinez and Eric A. Galapon
Samahang Pisika ng Pilipinas (SPP) (2009): ISSN 1656-2666 Volume 6. 17.

Abstract. Diffraction in time of a particle initially confined in a box is studied under linear potential. Moshinsky’s shutter problem is generalized to include new initial conditions with linear potential, which show double temporal diffraction for each opposite-moving plane wave with occurrence of reflected wave at later times. Density profiles at transient times and later times are discussed. Twofold Moshinsky’s diffraction in time for high energy state of the particle is also analyzed. Classical limit as box width L tends to zero is illustrated.

The full paper can be downloaded here.